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Abstract 

We  made  three  progresses  in  the  field  through  mass  estimation:  First,  we  propose  the  first 
adaptive  version  of  mass  estimation  using  a  new  nearest  neighbor  procedure  which  runs 
significantly  faster  than  existing  nearest  neighbor  procedures,  and  it  needs  no  indexing 
schemes.  Second,  we  propose  the  first  mass-based  Bayesian  classifier  which  estimates  the 
likelihood  directly  in  multi-dimensional  space;  unlike  existing  Bayesian  classifiers  which 
estimate  simplified  surrogates  of  likelihood  (e.g.,  one- dimensional  likelihood).  Third,  we 
have  created  the  first  mass-based  similarity  measure  and  show  that  it  is  an  effective 
alternative  to  distance-based  similarity  measure  in  content-based  information  retrieval 
problems. 
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In  addition,  we  have  extended  the  two  previous  works  on  mass  estimation  and  published 
them  in  Machine  Learning  Journal  and  Journal  of  Knowledge  and  Information  Systems. 


1  Introduction 

The  project  deepens  the  research  achieved  in  the  pioneering  mass  estimation  project 
(supported  by  AOARD  Grant  FA2386-10-1-4052)  to  yield  the  next  generation  of  mass 
estimation  and  to  enable  mass  estimation  to  be  applied  more  widely  to  data  mining 
tasks.  The  project  has  been  refined  to  achieve  the  following  specific  aims: 


1.  Create  the  second  generation  of  mass  estimation 

2.  Develop  a  new  Bayesian  Classifier  using  mass  estimation 

3.  Propose  the  first  mass-based  similarity  measure 


In  the  previous  project,  mass  estimation  has  been  established  to  be  a  new  paradigm 
in  data  mining.  Like  density  estimation  in  the  existing  paradigm,  mass  estimation  has 
influence  in  many  aspects  of  data  mining.  We  have  previously  shown  that  mass-based 
approaches  work  more  effectively  and  efficiently  than  density-based  approaches  in  four 
data  mining  tasks  :  information  retrieval,  regression,  anomaly  detection  and  clustering. 

The  second  generation  of  mass  estimation  consolidates  mass  estimation’s  fundamental  role 
in  data  mining  in  terms  of  algorithmic  design,  and  demonstrate  that  the  new  paradigm 
based  on  mass  is  applicable  and  effective  in  areas  of  data  mining  as  widely  as  the  existing 
density  paradigm  has  applied  now. 

This  project  broadens  its  application  to  an  additional  data  mining  task:  classification  in 
the  Bayesian  framework.  The  resultant  mass-based  Bayesian  classifiers  can  deal  with  very 
large  data  sets  which  are  infeasible  with  existing  Bayesian  classifiers. 

The  unexpected  result  of  this  project  is  the  creation  of  the  mass-based  similarity  measure. 
The  new  measure  is  unique  because  it  does  not  compute  distance  and  it  is  a  non-metric 
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measure.  It  could  potentially  enhance  similarity  modelling  envisaged  by  psychologists. 


The  ultimate  goal  of  the  work  is  to  produce  a  complete  theory  of  mass  estimation  that 
enables  mining  of  big  data  in  all  data  mining  tasks.  This  project  has  contributed  to  the 
generalisation  of  the  initial  formulation  of  mass  estimation  and  has  affirmed  its  funda¬ 
mental  data  modelling  mechanism  which  has  generic  applicability  to  many  areas  of  data 
mining  tasks. 

We  provide  the  theoretical  analyses  in  Section  2,  the  results  and  discussion  in  Section  3, 
and  the  final  remark  in  Section  4. 


2  Theoretical  Analyses 

This  section  provides  the  description  of  the  theoretical  analyses  for  the  abovementioned 
three  aims  in  the  following  three  subsections. 

2.1  Create  the  second  generation  of  mass  estimation 

We  establish  the  properties  of  a  good  mass  estimation  method  in  guiding  us  to  create  the 
second  generation  of  mass  estimation.  The  first  generation  of  mass  estimation  is  based 
on  a  tree  structure  to  partition  the  feature  space  into  local  regions.  The  work  is  guided 
by  answering  the  following  questions: 

•  What  are  the  characteristics  of  local  regions  necessary  for  good  mass  estimation? 

•  Are  pair-wise  calculations,  as  in  the  case  of  pair-wise  distance  calculations  in  existing 
approaches,  required  to  achieve  good  task-specific  performance? 

•  What  is  the  alternative  to  the  tree-based  approach  for  mass  estimation? 

We  explored  different  ways  to  partition  the  space  and  identify  the  characteristics  of  the 
local  regions  required  for  a  specific  task.  We  also  created  the  first  non-tree-based  approach 
for  mass  estimation. 

As  most  existing  algorithms  are  density  based,  we  have  designed  our  work  to  evaluate 
the  efficacy  of  mass  estimation  by  first  building  new  density  estimators  based  on  mass, 
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and  then  replacing  the  density  estimators  in  existing  algorithms  with  the  new  density 
estimators.  In  each  case,  the  same  algorithm  is  used;  only  the  core  modelling  mechanism 
used  is  different,  i.e. ,  either  existing  density  estimator  or  the  new  density  estimator  based 
on  mass.  We  focus  our  investigation  in  two  existing  algorithms  in  two  separate  tasks,  i.e., 
LOF  in  anomaly  detection  and  DBSCAN  in  clustering. 

2.2  Develop  a  new  Bayesian  Classifier  using  mass  estimation 

We  have  proposed  the  first  mass-based  classifier,  a  new  Bayesian  classifier  called  Mass- 
Bayes  which  has  constant  training  time  complexity  and  constant  space  complexity.  Un¬ 
like  existing  Bayesian  classifiers  which  must  make  some  assumption  that  allows  them  to 
estimate  simplified  surrogates  of  likelihood  p(x.\y),  the  new  Bayesian  classifier  estimates 
the  likelihood  directly  without  any  assumptions.  This  is  made  possible  by  the  use  of  mass 
estimation  to  estimate  the  likelihood  directly  in  multi- dimensional  space.  It  aggregates 
the  multi-dimensional  likelihoods  estimated  from  random  subsets  of  the  training  data  us¬ 
ing  varying  size  random  feature  subsets. 

The  symbols  used  in  the  following  description  are  provided  in  Table  1. 

In  Bayesian  classifier  learning,  estimating  the  joint  probability  distribution  p(x,  y)  or  the 
likelihood  p(x|y)  directly  from  training  data  is  considered  to  be  difficult,  especially  in 
large  multi-dimensional  data  sets.  In  order  to  circumvent  this  difficulty,  existing  Bayesian 
classifiers  such  as  Naive  Bayes,  BayesNet  and  A77DE  have  focused  on  estimating  simplified 
surrogates  of  p(x,  y )  from  different  forms  of  one- dimensional  likelihoods. 

Naive  Bayes  (NB)  is  the  simplest  generative  approach  that  estimates  p(x,  y)  by  assuming 
that  the  attributes  are  statistically  independent  given  y: 

d 

p(x,y)NB  =p(y)Y[p(xi\y)  (1) 

i=  1 

BayesNet  learns  a  network  of  probabilistic  relationship  among  the  attributes  including 
the  class  attribute  and  learned  the  conditional  probabilities  from  the  training  data.  The 
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Table  1:  List  of  symbols  used  in  Sections  2.2  and  3.2. 


D  Set  of  training  instances 
n  Number  of  training  instances  in  D 

d  Number  of  attributes 

c  Number  of  classes 

|  •  |  Cardinality  of  a  set 
p(-)  Probability 

x  d-dimensional  vector  representing  a  data  instance 
y  Class  label  of  a  data  instance 
Xi  The  value  of  the  ith  attribute  of  the  instance  x 
7 r.j  The  set  of  parents  of  ay  in  a  Bayesian  network 

7 Ty  The  set  of  parents  of  y  in  a  Bayesian  network 

//  Number  of  privileged  attributes  in  A//DE 

v  Average  number  of  discrete  values  for  an  attribute 

Sv  The  collection  of  all  subsets  of  size  rj  of  the  set  of  d  attributes 

s  A  subset  of  attributes  of  size  r) 

xa  |  a  (-dimensional  vector  of  values  of  x  defined  by  a  C  {1,  •  •  •  ,  d} 

A  binomial  coefficient  of  rj  out  of  d 
t  Number  of  trees  in  MassBayes 

ip  Sub-sample  size 

Gt  A  collect  of  t  subsets  of  varying  sizes  of  d  attributes, 
g  A  random  subset  of  attributes 

V  A  subset  of  D  of  size  ip 

T(-)  A  function  that  divides  the  feature  space  into  non-overlapping  regions 

T(x)  The  region  where  x  falls  into 

T>y  The  set  of  instances  belonging  to  class  y  in  T> 

2\,Xg  The  set  of  instances  belonging  to  class  y  and  having  values  xg  in  T> 


joint  probability  p(x,  y)  is  estimated  as: 


d 

Pfe-i  V) BayesNet  =  P(vWy)  (2) 

i— 1 

In  another  simplification  of  BayesNet,  A//DE  relaxes  the  independence  assumption  by 
allowing  dependency  between  y  and  a  fixed  number  of  privileged  attributes  or  super¬ 
parents.  The  other  attributes  are  assumed  to  be  independent  given  the  rj  super-parents 
and  y.  A?/DE  avoids  the  expensive  searching  in  learning  probabilistic  dependencies  by 
constructing  an  ensemble  of  //-dependence  estimators.  The  joint  probability  p(x,y)  is 


5 


estimated  as: 

P(x,y)AVDE  =  ^2p{*s,y)  JJ  p(xj\xs,y)  (3) 

seS*!  je{i,2,-  ,d}\s 

where  Sv  is  the  collection  of  all  subsets  of  size  r/  of  the  set  of  d  attributes  {1,  2,  •  •  •  ,  d}; 
and  xs  is  a  rj- dimensional  vector  of  values  of  x  defined  by  s. 

It  has  been  shown  that  AIDE  and  A2DE  produce  better  predictive  accuracy  than  the 
other  state-of-the-art  generative  classifiers.  However,  it  only  allows  dependencies  on  a 
fixed  number  of  attributes  and  y.  Because  of  the  high  time  complexity  of  O  ^(r?^1)  j 

and  space  complexity  of  O  (c(r;^1)uT?+1^ ,  where  v  is  the  average  number  of  values  for 
an  attribute,  only  A2DE  or  A3DE  is  feasible  even  for  a  moderate  number  of  attributes. 
Furthermore,  selecting  an  appropriate  value  of  n  for  a  particular  data  set  requires  a  search. 


A//DE  and  many  other  implementations  of  BayesNet  require  all  the  attributes  to  be  dis¬ 
crete.  The  continuous-valued  attributes  must  be  discretised  using  a  discretisation  method 
before  building  a  classifier. 


Rather  than  aggregating  an  ensemble  of  //-dependence  single- dimensional  likelihood  esti¬ 
mators,  we  propose  to  aggregate  an  ensemble  of  t  multi-dimensional  likelihood  estimators 
where  each  likelihood  is  estimated  using  different  random  subsets  of  d  attributes  from 
data.  The  likelihood  p(x|?/)  is  estimated  as: 


p(*\y) 


7  p(xs \y) 

geGt 


(4) 


where  Gt  is  a  collection  of  t  subsets  of  varying  sizes  of  d  attributes;  and  xg  is  a  |g 
dimensional  vector  of  values  of  x  defined  by  g;  and  1  <  |g|  <  d. 


Each  p(xg|//)  is  estimated  using  a  random  subset  of  training  instances  T>  C  D,  where 
\T>\  =  -0  <  n. 

p(x»  =  (5) 

where  \T>y^g \  is  the  number  of  instances  having  attribute  values  xg  belonging  to  class  y 
in  T>  and  \T>y\  is  the  number  of  instances  belonging  to  class  y  in  V. 
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Rather  than  relying  on  a  specific  discretisation  method  in  the  preprocessing  step,  we  pro¬ 
pose  to  build  a  model  directly  from  data,  akin  to  an  adaptive  multi-dimensional  histogram, 
to  determine  xg  which  adapts  to  the  local  data  distribution.  Let  T(-)  be  a  function  that 
divides  the  feature  space  into  non-overlapping  regions  and  T(x)  be  the  region  where  x 
falls.  In  a  multi- dimensional  space,  each  instance  in  T>  can  be  isolated  by  splitting  only 
on  few  dimensions  i.e.,  only  a  subset  of  d  attributes  (g  C  {1,2,  ••  •  ,d})  is  used  to  define 
T(x).  Hence,  \T>y_Xf,\  is  the  number  of  instances  belonging  to  class  y  in  the  region  T(x). 
Let  p(T(x)\y)  be  the  probability  of  region  T(x)  when  only  class  y  instances  in  T>  are 
considered. 

p(T(x)\y)  =  p(xg|y)  =  yfy-  (6) 

The  new  generative  classifier,  called  MassBayes,  estimates  the  joint  distribution  as: 

1  .  1  * 

P{*,y) MassBayes  =  p(y)  7  =  PM  7  (7) 

g  eGt  i=  1 

The  average  probability  of  t  different  regions  Tj(x)  (i  =  1,2,- ••  ,t),  constructed  using 
Vt  C  D ,  provides  a  good  estimate  for  p(x|y)  as  it  estimates  the  multi-dimensional  like¬ 
lihood  by  considering  the  distribution  in  different  local  neighbourhood  of  x  in  the  data 
space.  An  illustrative  example  is  provided  in  Figure  1. 


Figure  1:  Different  regions  from  different  Tj(-)  (i  —  1,  2,  •  •  •  ,  5)  that  cover  x. 
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MassBayes  has  the  following  characteristics  in  comparison  with  A//DE: 

1.  In  each  estimator,  A77DE  estimates  one-dimensional  likelihood  given  a  fixed  number 
of  super-parents  and  y,  whereas  MassBayes  estimates  multi-dimensional  likelihood 
using  varying  number  of  attributes. 

2.  In  A//DE,  the  ensemble  size  is  fixed  to  QJ).  But,  MassBayes  allows  the  flexibility  for 
users  to  set  the  ensemble  size. 

3.  A//DE  requires  continuous-valued  attributes  to  be  cliscretised  before  building  the 
model.  The  performance  of  A//DE  is  affected  by  the  discretisation  technique  used. 
In  contrast,  MassBayes  builds  models  directly  from  data.  It  can  be  viewed  as  a 
dynamic  multi- dimensional  discretisation  where  the  information  loss  is  minimised 
by  averaging  over  multiple  models. 

4.  Each  model  in  MassBayes  is  built  with  training  subset  of  size  ^  <  n  which  gives 
rise  to  the  constant  training  time.  In  contrast,  each  model  in  A//DE  is  trained  using 
the  entire  training  set. 

5.  A//DE  is  a  deterministic  algorithm  whereas  MassBayes  is  a  randomised  algorithm. 

6.  Like  A77DE,  MassBayes  is  a  generative  classifier  without  search. 

2.3  Propose  the  first  mass-based  similarity  measure 

Data  mining  algorithms  have  traditionally  relied  on  similarity  measures  to  gauge  the  sim¬ 
ilarity  between  two  instances  as  the  core  operation  to  solve  various  data  mining  problems. 
For  example,  anomaly  detection  requires  ranking  of  instances  in  a  database  according 
to  their  degrees  of  anomaly;  an  information  retrieval  task  ranks  instances  in  a  database 
which  are  most  similar  to  a  query.  These  ranking  tasks  are  traditionally  accomplished 
by  computing  the  distance  between  two  instances  as  the  key  step  to  calculate  the  ranking. 

The  first  mass-based  similarity  measure  is  motivated  by  our  recent  content-based  multi¬ 
media  information  retrieval  (CBMIR)  system  called  ReFeat  [8]. 


ReFeat  is  unique  in  two  aspects.  First,  it  uses  a  similarity  measure  which  is  primarily 
based  on  data  distribution  in  the  local  region.  In  contrast,  commonly  used  distance  mea¬ 
sures  are  solely  based  on  the  positions  of  instances  in  the  feature  space.  Second,  at  the 
heart  of  ReFeat  is  an  anomaly  detector  which  provides  a  ranking  score  £(x)  for  an  instance 
x,  independently  of  other  instances.  This  is  fundamentally  different  from  most  ranking 
measures  that  rely  on  a  distance  measure  to  compute  the  distance  of  an  instance  relative 
to  another  instance,  dist(x,  y)  (e.g.,  ORCA,  Qsim). 

The  use  of  such  a  unique  similar  measure  is  the  key  reason  why  ReFeat  has  produced  bet¬ 
ter  retrieval  performance  than  state-of-the-art  CBMIR  systems  including  manifold  learn¬ 
ing  method  MRBIR,  Bayesian  learning  method  BALAS,  query-sensitive  ranking  methods 
such  as  InstRank  and  Qsim. 

Despite  its  unique  approach  and  demonstrably  excellent  retrieval  performance,  the  ReFeat 
paper  [8]  does  not  provide  a  satisfactory  explanation  as  to  why  a  unary  score  function 
could  produce  an  appropriate  ranking  of  database  instances  for  a  query  which  requires 
a  binary  function.  More  to  the  point:  ReFeat  does  not  guarantee  that  two  ‘similar’  in¬ 
stances,  having  a  similar  ranking  score  £(■),  are  in  the  same  local  neighbourhood. 

We  investigate  the  source  of  the  power  of  ReFeat.  Note  that  the  anomaly  detector  used 
in  ReFeat  is  a  special  case  of  mass-based  approach,  revealed  in  our  journal  paper  on  mass 
estimation  [1]. 

From  a  foundation  in  mass  estimation,  we  derive  a  new  mass-based  similarity  measure 
that  enables  a  new  CMBIR  system  to  significantly  improve  the  retrieval  performance  of 
ReFeat. 


3  Results  and  Discussion 

This  section  provides  the  results  and  discussion  for  each  of  the  three  aims  in  the  following 
three  subsections. 
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3.1  Create  the  second  generation  of  mass  estimation 

3.1.1  Properties  of  a  good  mass  estimation  method 

The  most  successful  method  for  mass  estimation  thus  far  is  to  aggregate  mass  from  mul¬ 
tiple  local  regions.  There  are  different  ways  in  which  local  regions  can  be  constructed, 
using  trees  in  onr  previous  work  [1,  2]  and  nearest  neighbours  in  the  current  project  [4], 
They  all  have  been  shown  to  perform  well  in  a  number  of  different  data  mining  tasks. 
The  trees  could  be  built  using  random  splits,  equal- volume  splits  or  equal-data-size  split; 
or  the  region  could  be  grid-based  or  non-grid-based. 

However,  certain  tasks  demand  some  properties  that  need  to  be  met.  For  example,  in  clus¬ 
tering  tasks,  the  local  regions  must  be  close  regions,  usually  defined  using  all  dimensions. 
This  is  to  avoid  points,  which  are  near  in  some  dimensions  but  far  in  others,  from  linking 
into  one  cluster.  On  the  other  hand,  in  classification  tasks,  the  local  regions  must  be  open 
to  cover  sufficiently  large  regions  in  order  to  provide  a  good  estimation  of  probabilities  in 
the  Bayesian  classification  context.  In  anomaly  detection  tasks,  either  open  or  close  local 
regions  are  found  to  work  well. 

The  mass-based  methods  show  that  local  regions  are  sufficient  to  determine  the  “neigh¬ 
bours”  required  to  accomplish  classification,  clustering  and  anomaly  detection  tasks,  with¬ 
out  calculating  pair-wise  distance.  This  eliminates  the  need  to  compute  pair-wise  distance; 
and  it  is  the  key  contributor  to  the  significant  speedup  and  less  memory  requirement  in 
mass-based  methods  in  comparison  with  existing  methods  which  require  distance  calcu¬ 
lations. 

3.1.2  A  new  approach  to  nearest  neighbour  density  estimator  based  on  the 
second  generation  of  mass  estimation 

The  first  mass-based  approach  we  have  developed  under  this  project  has  advanced  the 
mass-based  approaches  developed  in  the  previous  AOARD  supported  project  in  one  key 
important  aspect:  mass  estimation  is  adaptive  to  the  local  data  distributions.  Previous 
mass  estimation  approaches  (published  in  IEEE  ICDM-2010  and  IEEE  ICDM-2011)  are 
unable  to  adapt  to  differing  local  data  distributions  in  a  single  data  set. 
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The  new  approach  produces  the  first  nearest  neighbour  procedure  having  0(n)  time  com¬ 
plexity  and  constant  space  complexity.  In  contrast,  existing  nearest  neighbor  algorithms 
have  0(n2)  time  complexity  and  0(n)  space  complexity.  It  is  the  first  nearest  neighbour 
density  estimator  to  have  linear  time  complexity  and  constant  space  complexity,  as  far  as 
we  know. 

Unlike  existing  algorithms  which  require  some  indexing  scheme  to  speed  up  the  nearest 
neighbor  search,  the  new  approach  has  no  such  requirement.  Replacing  the  nearest  neigh¬ 
bor  procedure  in  existing  algorithms  with  the  new  procedure,  we  have  shown  that  this 
enables  two  nearest  neighbour  algorithms,  otherwise  infeasible,  to  scale  up  to  data  sets 
with  millions  of  instances  in  anomaly  detection  and  clustering  tasks. 

The  new  density  estimator  called  LiNearN,  for  linear  time  nearest  neighbour  algorithm, 
has  the  following  distinctive  nearest  neighbor  features: 

•  By  rejecting  the  premise  that  a  nearest  neighbour  algorithm  must  find  the  near¬ 
est  neighbour  for  every  instance  in  the  given  data  set,  LiNearN  finds  the  nearest 
neighbour  for  every  instance  in  a  subsample. 

•  LiNearN  defines  local  neighbourhood  using  nearest  neighbours  in  each  of  the  many 
subsamples  by  building  a  region  centered  at  each  instance.  This  differs  from  the 
existing  nearest  neighbour  density  estimators  where  the  local  neighbourhoods  are 
defined  based  on  either  k  nearest  neighbours  or  a  fixed  radius. 

Our  asymptotic  analysis  reveals  that  the  new  density  estimator  has  a  parameter  which 
trades  off  between  bias  and  variance,  as  in  existing  density  estimators  such  as  /c-nearest 
neighbour  density  estimators. 

We  assess  LiNearN  in  anomaly  detection  and  clustering  tasks  and  compare  with  three 
state-of-the-art  nearest  neighbour  algorithms,  ORCA,  LOF  and  DBSCAN.  LiNearN  pro¬ 
duces  similar  results  compared  with  these  algorithms  in  terms  of  task-specific  performance 
measures,  but  it  runs  orders  of  magnitude  faster  than  these  algorithms  in  large  data  sets. 
The  advantages  of  the  new  nearest  neighbour  approach  shown  in  these  two  tasks  imply 
that  it  can  potentially  be  adopted,  in  place  of  existing  nearest  neighbour  algorithms,  to 
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solve  other  data  mining  tasks. 


The  full  details  of  the  LiNearN  results  can  be  found  in  the  attached  paper  [4]. 

3.2  Develop  a  new  Bayesian  Classifier  using  mass  estimation 

The  new  Bayesian  classifier  based  on  mass  called  MassBayes  has  been  shown  to  have  bet¬ 
ter  predictive  accuracy  than  existing  state-of-the-art  Bayesian  classifiers  such  as  BayesNet 
and  A77DE,  especially  in  large  data  sets.  It  also  scales  better  than  these  classifiers  in  large 
data  sets  because  of  its  constant  training  time  complexity.  Unlike  DEMassBayes  [2],  the 
mass-based  Bayesian  classifier  does  not  make  use  of  a  density  estimator  to  estimate  the 
likelihood. 

The  average  classification  accuracies  of  MassBayes  against  the  existing  generative  classi¬ 
fiers  and  DEMassBayes  over  a  10-fold  cross-validation  are  plotted  in  Figure  2.  A  total  of 
fifteen  data  sets  were  used  in  this  experiment.  The  coordinate  values  of  each  point  in  the 
plot  are  the  accuracies  of  each  pair  of  classifiers  in  a  data  set.  If  both  the  classifiers  had 
produced  the  same  accuracies  in  a  data  set,  the  point  representing  that  data  set  would 
lie  on  the  diagonal.  In  both  the  plots,  many  points  lie  below  the  diagonal  line  and  only 
a  few  points  are  above.  This  shows  that  MassBayes  is  better  than  the  existing  Bayesian 
classifiers. 

The  runtime  of  MassBayes  was  an  order  of  magnitude  faster  than  some  contenders  (such  as 
A2DE  and  NB-KDE)  in  large  data  sets  and  it  was  competitive  to  many  existing  Bayesian 
classifiers  in  many  data  sets.  The  runtime  of  MassBayes  and  the  existing  Bayesian  clas¬ 
sifiers  in  the  three  largest  data  sets  -  KDDCup99,  YearPrediction  and  CoverType  was 
presented  in  Figure  3.  It  is  interesting  to  note  that  BayesNet,  A2DE  and  A3DE  in  the 
KDDCup99  data  set  and  BayesNet  and  A3DE  in  YearPrediction  did  not  complete  the 
task  because  they  required  memory  space  more  than  20GB.  The  memory  requirement 
in  BayesNet  and  A77DE  increases  with  d  and  c.  ENNBayes  is  a  version  of  MassBayes 
implemented  nearest  neighbour;  that  is  why  it  runs  slower  than  MassBayes. 

Note  that  the  reported  runtime  results  for  A//DE,  BayesNet  and  NB-Disc  did  not  include 
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(a)  MassBayes  versus  BayesNet  and  NB  (b)  MassBayes  versus  DEMassBayes 

Figure  2:  Scatter  plot  of  accuracies  of  MassBayes  versus  those  of  BayesNet,  two  variants 
of  Naive  Bayes  (NB-KDE  and  NB-Disc)  and  DEMass-Bayes. 

the  discretisation  time  that  must  be  done  as  a  preprocessing  step,  which  give  A//DE, 
BayesNet  and  NB-Disc  an  unfair  advantage  over  MassBayes.  The  discretisation  cost  is 
significantly  large  in  large  and  moderately  high  dimensional  data  sets.  For  examples, 
the  discretisation  took  1290  seconds  in  the  KDDCup99  data  set,  and  467  seconds  in  the 
YearPrediction  data  set.  The  discretisation  time  itself  was  of  the  same  order  of  magnitude 
as  the  runtime  of  MassBayes. 

Figures  4  and  5  show  the  increase  in  training  time  and  space  required  to  store  the  clas¬ 
sification  model  of  MassBayes  and  three  variants  of  Ar/DE  when  the  number  of  training 
instances  (n)  and  the  number  of  attributes  (d)  are  increased  by  factors  (5,  10,  50,  100, 
500)  and  (2,  4,  8,  12  16)  respectively.  In  MassBayes,  both  the  runtime  and  memory 
requirement  were  constant  when  the  training  size  was  increased,  whereas  they  varied  sub- 
linearly  when  number  of  attributes  were  increased.  In  contrast,  the  runtime  and  memory 
requirement  of  A77DE  increase  exponentially  as  training  size  increases.  With  respect  to 
the  increase  in  the  number  of  attributes,  the  runtime  and  memory  requirement  of  A77DE 
also  increase  exponentially. 

The  details  of  the  results  are  reported  in  the  attached  papers  [3,  6]. 
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KDDCup99  YearPrediction  CoverType 

Figure  3:  Runtime  of  MassBayes,  ENNBayes  and  the  other  contenders  in  the  three  largest 
data  sets:  KDDCup99,  CoverType  and  YearPrediction.  The  vertical  axis  is  on  a  logarith¬ 
mic  scale  of  base  10.  For  ease  of  reading,  the  classifiers  are  organised  into  groups  of  three: 
the  Erst  group  has  three  classifiers  (MassBayes,  ENNBayes  and  DEMassBayes)  based  on 
the  proposed  ensemble  approach;  the  second  group  has  three  variants  of  A77DE  (A3DE, 
A2DE  and  AIDE);  and  the  last  group  has  BayesNet,  NB-KDE  and  NB-Disc.  Note  that 
the  discretisation  time  was  not  included  in  the  runtime  of  A77DE,  BayesNet  and  NB-Disc. 
Histograms  with  star  which  have  the  maximum  height  indicate  that  the  classifiers  did  not 
complete  the  tasks. 
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Figure  4:  Increase  in  training  time  and  memory  requirement  to  learn  a  classification  model 
with  the  increase  in  training  size  in  a  subset  of  KDDCup99  data  set  with  three  largest 
classes  (i.e. ,  c  =  3,  d  =  32).  The  horizontal  axes  are  on  a  logarithmic  scale  of  base  10. 
A3DE  did  not  complete  when  the  training  size  was  increased  to  five  million.  The  base  of 
the  training  size  ratio  is  10000  instances. 


(a)  Training  Time 


(b)  Memory  requirement 


Figure  5:  Increase  in  training  time  and  memory  requirement  to  learn  a  classification  model 
with  the  increase  in  the  number  of  attributes  in  a  subset  of  KDDCup99  data  set  with 
three  largest  classes  (i.e.,  c  =  3,n  =  5125369).  The  number  of  attributes  is  increased  from 
2  to  32.  The  horizontal  and  vertical  axes  in  Figures  (a)  and  (b)  are  on  a  logarithmic  scale 
of  base  2  and  10,  respectively.  A3DE  did  not  complete  when  the  number  of  attributes 
was  increased  to  32. 
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3.3  Propose  the  first  mass-based  similarity  measure 

This  section  describes  the  preliminary  result  of  the  first  mas-based  similarity  measure. 
The  key  results  at  this  point  in  time  are  summarised  as  follows: 

•  Introducing  a  unique  similarity  measure,  Massim,  and  establishing  its  theoretical 
foundation  based  on  mass. 

•  Creating  a  new  CMBIR  system  called  MassIR  based  on  this  similarity  measure. 

•  Empirically  evaluating  MassIR  in  comparison  with  ReFeat  and  systems  which  em¬ 
ploy  commonly  used  similarity  measures,  and  showing  its  superiority  in  image  and 
music  information  retrievals. 

Massim  has  the  following  characteristics: 

•  Unlike  the  similarity  measure  used  in  ReFeat,  Massim  guarantees  that  two  similar 
instances  are  in  the  same  local  neighbourhood. 

•  Unlike  distance-based  similarity  measures,  it  does  not  compute  distance  and  pri¬ 
marily  based  on  data  distribution  in  the  local  region. 

•  It  is  a  generalisation  of  mass  estimation.  Under  certain  conditions,  it  reduces  to 
mass  estimation. 

The  key  result  of  the  evaluation  is  provided  here,  and  the  aim  is  to  compare  MassIR  with 
state-of-the-art  systems  ReFeat,  Qsirn,  InstRank,  MRBIR  and  BALAS,  all  in  the  context 
of  content-based  information  retrieval. 

Two  databases,  which  were  previously  used  in  other  studies:  GTZAN  music  database  and 
COREL  image  database,  are  employed  in  this  experiment. 

Figure  6  shows  the  comparison  of  MassIR  with  ReFeat,  Qsirn,  InstRank,  MRBIR  and 
BALAS.  Results  of  Qsirn,  InstRank,  MRBIR  and  BALAS  were  taken  from  [8].  The  result 
shows  that  MassIR  performs  significantly  better  than  all  other  existing  methods.  Also 
note  that  the  performance  gap  increases  as  the  number  of  feedback  rounds  increases,  in¬ 
dicating  that  MassIR  utilizes  feedback  instances  more  effectively  than  other  methods. 
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Round  of  relevance  feedback  Round  of  relevance  feedback 

(a)  GTZAN  music  database  (b)  COREL  image  database 

Figure  6:  Comparison  of  retrieval  performance  in  terms  of  MAP(Mean  Average  Precision). 

The  full  description  of  Massim  and  MassIR  and  the  complete  evaluation  results  are  pro¬ 
vided  in  the  attached  paper  [5]. 

This  preliminary  result  provides  the  foundation  for  further  investigation  in  a  new  project 
to  establish  the  theory  and  evidence  that  non-metric  similarity  measures  can  work  better 
than  commonly  used  similarity  measures  based  on  a  metric  in  data  mining  tasks. 

4  Final  Remark 

The  successful  creation  of  the  second  generation  of  mass  estimation  using  nearest  neigh¬ 
bour  approach  represents  a  milestone  that  will  have  significant  influence  in  the  future 
development  of  mass  estimation.  While  developing  the  new  Bayesian  classifier  using  the 
first  generation  of  mass  estimation  reported  here,  we  had  also  spent  a  substantial  amount 
of  time  to  generalise  the  approach  which  incorporates  the  second  generation  of  mass  es¬ 
timation.  This  has  yielded  ENNBayes  reported  in  paper  [6]. 

The  work  on  the  first  mass-based  similarity  measure,  though  envisaged  and  described 
briefly  in  the  initial  research  proposal  of  this  project,  was  not  one  of  the  original  aims 
of  this  project.  However,  1  am  glad  that  the  progress  in  this  project  has  enabled  the 
research  team  to  produce  the  preliminary  findings  of  this  new  similarity  measure.  1  have 
the  opinion  that  the  finding  represents  the  beginning  of  the  next  phase  of  mass  estimation 
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research — the  research  team  has  shown  that  the  new  similarity  measure  is  a  generalisation 
of  mass  estimation.  This  is  an  exciting  development. 

The  work  on  the  generalisations  of  both  mass-based  Bayesian  classifier  approach  and  mass 
estimation  mentioned  above  had  unfortunately  taken  time  away  from  the  research  initially 
planned  for  data  streams  and  high  dimensional  issues.  However,  I  believe  that  the  results 
the  research  team  has  produced  form  a  better  foundation  for  researching  these  and  other 
issues  in  the  future. 
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[4]  Jonathan  R.  Wells,  Kai  Ming  Ting  and  Takashi  Washio.  LiNearN:  A  New  Approach 
to  Nearest  Neighbour  Density  Estimator.  Submitted  to  Pattern  Recognition 
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[5]  Kai  Ming  Ting,  Thilak  Laksiri  Fernando  and  Geoffrey  I.  Webb.  Mass-based  Similarity 
Measure:  An  Effective  Alternative  to  Distance-based  Similarity  Measures.  Submitted 
to  2013  IEEE  International  Conference  on  Data  Mining 

[6]  Sunil  Aryal  and  Kai  Ming  Ting.  An  ensemble  approach  to  estimate  multi-dimensional 
likelihood  in  Bayesian  classifier  learning.  Submitted  to  Machine  Learning 
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Collaboration  with  two  colleagues,  Takashi  Washio  (Osaka  University)  and  Geoff  Webb 
(Monash  University)  have  resulted  three  papers  in  mass  estimation  [2,  4,  5]. 

As  a  result  of  the  work  on  mass  estimation,  the  following  collaborations  have  produced 
one  industry  grant  and  two  publications: 

A.  Toyota  InfoTechnology  Centre  (Japan)  has  provided  a  grant  in  2012  to  work  on  an 
application  to  a  vehicle  warning  system  and  produced  the  following  publication: 

[7]  Jonathan  R.  Wells,  Kai  Ming  Ting  and  Naiwala  P.  Chandrasiri  (2012).  A  non-time 
series  approach  to  vehicle  related  time  series  problems.  In  Proceedings  of  the  Tenth 
Australasian  Data  Mining  Conference.  Volume  134  in  the  Conferences  in  Research  and 
Practice  in  Information  Technology  Series.  Australian  Computer  Society. 

B.  Research  collaboration  with  Shandong  University  (China)  has  produced  paper  [1]  and 
the  following  publication: 

[8]  Guang-Tong  Zhou,  Kai  Ming  Ting,  Fei  Tony  Liu  and  Yilong  Yin  (2012).  Relevance 
Feature  Mapping  for  Content-based  Multimedia  Information  Retrieval.  Pattern  Recog¬ 
nition.  Vol.45,  pp.  1707-1720. 
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Note 


Paper  [1]  is  an  extension  of  the  work  previously  published  in  KDD-2010.  It  includes 
an  extension  from  single-dimensional  mass  estimation  to  multi-dimensional  mass  estima¬ 
tion;  and  their  empirical  evaluations  in  three  tasks  in  regression,  information  retrieval 
and  anomaly  detection;  and  an  in-depth  analysis  and  comparison  with  a  closely  related 
method  called  data  depth. 

Paper  [2]  extends  the  work  previously  published  in  IEEE  ICDM-2011  to  include  (i)  a  con¬ 
trast  between  point-based  definitions  and  set-based  definitions  of  the  proposed  mass-based 
density  estimator;  and  (ii)  an  application  of  mass-based  density  estimator  to  Bayesian 
classifier  and  an  in-depth  comparison  with  existing  state-of-the-art  Bayesian  classifiers. 

Paper  [6]  extends  the  work  previously  published  in  PAKDD-2013  [3]  to  present  the  first 
generic  approach  to  estimate  multi-dimensional  likelihood  p(x|y)  directly  by  aggregating 
Pi(x.\y)  estimated  from  an  ensemble  of  estimators  where  each  estimator  is  constructed 
from  a  small  fixed-size  random  sub-sample  of  data  T>1  C  D  (i  =  1,2 This  is  a 
generic  approach  because  /y(x|y)  can  be  estimated  using  different  data  modelling  meth¬ 
ods.  DEMass-Bayes  [2]  and  MassBayes  [3]  are  two  realisations  of  the  proposed  generic 
approach.  In  this  paper,  we  introduce  an  additional  realisation  of  the  proposed  generic 
approach  called  ENNBayes  along  with  MassBayes.  ENNBayes  estimates  pi(x.\y)  from  T>t 
using  a  nearest  neighbour  density  estimator. 


Software  Downloads 

The  source  codes  of  multi- dimensional  mass  estimation,  DEMass-DBSCAN,  DEMass- 
Bayes  and  MassBayes,  algorithms  proposed  in  papers  [1,  2,  3],  are  made  available  at 
http : / / sourcef orge . net/projects/mass-estimation/ 


Attachments 

Papers  [1,  2,  3,  4,  5,  6]  are  attached. 
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Abstract  This  paper  introduces  mass  estimation — a  base  modelling  mechanism  that  can 
be  employed  to  solve  various  tasks  in  machine  learning.  We  present  the  theoretical  basis  of 
mass  and  efficient  methods  to  estimate  mass.  We  show  that  mass  estimation  solves  problems 
effectively  in  tasks  such  as  information  retrieval,  regression  and  anomaly  detection.  The 
models,  which  use  mass  in  these  three  tasks,  perform  at  least  as  well  as  and  often  better  than 
eight  state-of-the-art  methods  in  terms  of  task-specific  performance  measures.  In  addition, 
mass  estimation  has  constant  time  and  space  complexities. 
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1  Introduction 

‘Estimation  of  densities  is  a  universal  problem  of  statistics  (knowing  the  densities  one 
can  solve  various  problems).’  — Vapnik  (2000). 

Density  estimation  has  been  the  base  modelling  mechanism  used  in  many  techniques 
designed  for  tasks  such  as  classification,  clustering,  anomaly  detection  and  information 
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retrieval.  For  example  in  classification,  density  estimation  is  employed  to  estimate  the 
class-conditional  density  function  (or  likelihood  function)  p(x\j)  or  posterior  probability 
p(j\x) — the  principal  function  underlying  many  classification  methods;  e.g.,  mixture  mod¬ 
els,  Bayesian  networks,  Naive  Bayes.  Examples  of  density  estimation  include  kernel  density 
estimation,  k-nearest  neighbours  density  estimation,  maximum  likelihood  procedures  and 
Bayesian  methods. 

Ranking  data  points  in  a  given  data  set  in  order  to  differentiate  core  points  from  fringe 
points  in  a  data  cloud  is  fundamental  in  many  tasks,  including  anomaly  detection  and  in¬ 
formation  retrieval.  Anomaly  detection  aims  to  rank  anomalous  points  higher  than  normal 
points;  information  retrieval  aims  to  rank  points  similar  to  a  query  higher  than  dissimi¬ 
lar  points.  Many  existing  methods  (e.g.,  Bay  and  Schwabacher  2003;  Breunig  et  al.  2000; 
Zhang  and  Zhang  2006)  have  employed  density  to  provide  the  ranking;  but  density  estima¬ 
tion  is  not  designed  to  provide  a  ranking. 

We  show  in  this  paper  that  a  new  base  modelling  mechanism  called  mass  estimation 
possesses  different  properties  from  those  offered  by  density  estimation: 

•  A  mass  distribution  stipulates  an  ordering  from  core  points  to  fringe  points  in  a  data 
cloud.  In  addition,  this  ordering  accentuates  the  fringe  points  with  a  concave  function 
derived  from  data,  resulting  in  fringe  points  having  markedly  smaller  mass  than  points 
close  to  the  core  points. 

•  Mass  estimation  is  more  efficient  than  density  estimation  because  mass  is  computed  by 
simple  counting  and  it  requires  only  a  small  sample  through  an  ensemble  approach.  Den¬ 
sity  estimation  (often  used  to  estimate  p{x\j )  and  p(j\x))  requires  a  large  sample  size 
in  order  to  have  a  good  estimation  and  is  computationally  expensive  in  terms  of  time  and 
space  complexities  (Duda  et  al.  2001). 

Mass  estimation  has  two  advantages  in  relation  to  efficacy  and  efficiency.  First,  the  con¬ 
cavity  property  mentioned  above  ensures  that  fringe  points  are  ‘stretched’  to  be  farther  from 
the  core  points  in  a  mass  space — making  it  easier  to  separate  fringe  points  from  those  points 
close  to  core  points.  This  property  in  mass  space  can  then  be  exploited  by  a  machine  learning 
algorithm  to  achieve  a  better  result  for  the  intended  task  than  applying  the  same  algorithm 
in  the  original  space  without  this  property.  We  show  the  efficacy  of  mass  in  improving  the 
task-specific  performance  of  four  existing  state-of-the-art  algorithms  in  information  retrieval 
and  regression  tasks.  The  significant  improvements  are  achieved  through  a  simple  mapping 
from  the  original  space  to  a  mass  space  using  the  mass  estimation  mechanism  introduced  in 
this  paper. 

Second,  mass  estimation  offers  to  solve  a  ranking  problem  more  efficiently  using  the 
ordering  derived  from  data  directly — without  expensive  distance  (or  related)  calculations. 
An  example  of  inefficient  application  is  in  anomaly  detection  tasks  where  many  methods 
have  employed  distance  or  density  to  provide  the  required  ranking.  An  existing  state-of-the- 
art  density-based  anomaly  detector  LOF  (Breunig  et  al.  2000)  (which  has  quadratic  time 
complexity)  completes  a  job  involving  half  a  million  data  points  in  more  than  five  hours; 
yet  the  mass-based  anomaly  detector  we  have  introduced  here  completes  it  in  less  than  20 
seconds!  Section  6.3  provides  the  details  of  this  example. 

The  rest  of  the  paper  is  organised  as  follows.  Section  2  introduces  mass  and  mass  es¬ 
timation,  together  with  their  theoretical  properties.  We  also  describe  methods  for  one¬ 
dimensional  mass  estimation.  We  extend  one-dimensional  mass  estimation  to  multi¬ 
dimensional  mass  estimation  in  Sect.  3.  We  provide  an  implementation  of  multi-dimensional 
mass  estimation  in  Sect.  4.  Section  5  describes  a  mass-based  formalism  which  serves  as  a 
basis  of  applying  mass  to  different  data  mining  tasks.  We  realise  the  formalism  in  three 
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different  tasks:  information  retrieval,  regression  and  anomaly  detection,  and  report  the  em¬ 
pirical  evaluation  results  in  Sect.  6.  The  relations  to  kernel  density  estimation,  data  depth  and 
other  related  work  are  described  in  Sects.  7,  8  and  9,  respectively.  We  provide  conclusions 
and  suggest  future  work  in  the  last  section. 


2  Mass  and  mass  estimation 

Data  mass  or  mass,  in  its  simplest  form,  is  defined  as  the  number  of  points  in  a  region.  Any 
two  groups  of  data  in  the  same  domain  have  the  same  mass  if  they  have  the  same  number 
of  points,  regardless  of  the  characteristics  of  the  regions  they  occupy  (e.g.,  density,  shape 
or  volume).  Mass  in  a  given  region  is  thus  defined  by  a  rectangular  function  which  has  the 
same  value  for  the  entire  region  in  which  the  mass  is  measured. 

To  estimate  the  mass  for  a  point  and  thus  the  mass  distribution  of  a  given  data  set,  a  more 
sophisticated  form  is  required.  The  intuition  is  based  on  the  simplest  form  described  above, 
but  multiple  (overlapping)  regions  covering  a  point  are  generated.  The  mass  for  the  point 
is  then  derived  from  an  average  of  masses  from  all  regions  covering  the  point.  We  show 
two  ways  to  define  these  regions.  The  first  is  to  generate  all  possible  regions  through  binary 
splits  from  the  given  data  points;  and  the  second  is  to  generate  random  axis-parallel  regions 
within  the  confine  covered  by  a  data  sample.  The  first  is  described  in  this  section  and  the 
second  is  described  in  Sect.  3. 

Each  region  can  be  defined  in  multiple  levels  where  a  higher  level  region  covering  a 
point  has  a  smaller  volume  than  that  of  a  lower  level  region  covering  the  same  point.  We 
show  that  the  mass  distribution  has  special  properties:  (i)  the  mass  distribution  defined  by 
level- 1  regions  is  a  concave  function  which  has  the  maximum  mass  at  the  centre  of  the  data 
cloud,  irrespective  of  its  density  distribution,  including  uniform  and  U-shape  distributions; 
and  (ii)  higher  level  regions  are  required  to  model  multi-modal  mass  distributions. 

Note  that  mass  is  not  a  probability  mass  function,  and  it  does  not  provide  a  probability, 
as  the  probability  density  function  does  through  integration. 

In  Sect.  2.1,  we  show  (i)  how  to  estimate  a  mass  distribution  for  a  given  data  set  through 
binary  splits  and  (ii)  the  theoretical  properties  of  mass  estimation.  Section  2.2  describes  an 
approximation  to  the  theoretical  mass  estimation  which  works  more  efficiently  in  practice. 
The  symbols  and  notations  used  are  provided  in  Table  1 . 


Table  1  Symbols  and  notations 


nu 

A  real  domain  of  u  dimensions 

X 

A  one-dimensional  instance  in  7Z 

X 

An  instance  in  7 Zu 

D 

A  data  set  of  x,  where  \D\  =  n 

V 

A  subset  of  D,  where  \D\  =  \J/ 

Z 

An  instance  in  IV 

D' 

A  data  set  of  z 

c 

The  ensemble  size  used  to  estimate  mass 

h 

Level  of  mass  distribution 

t 

Number  of  mass  distributions  in  inass(-) 

«i(0 

Mass  base  function  defined  using  binary  split  s; 

mass(-) 

Mass  function  which  returns  a  real  value  in  one-dimensional  mass  space 

mass(-) 

Mass  function  which  returns  a  vector  of  t  values  in  ^-dimensional  mass  space 
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2.1  Mass  distribution  estimation 

In  this  section,  we  first  show  in  Sect.  2.1.1  a  mass  distribution  estimation  that  uses  binary 
splits  in  the  one-dimensional  setting,  where  each  binary  split  separates  the  one-dimensional 
space  into  two  non-empty  regions.  In  Sect.  2.1.2,  we  then  generalise  the  treatment  using 
multiple  levels  of  binary  splits. 

2.1.1  Mass  distribution  estimation  using  binary  splits 

Here,  we  employ  a  binary  split  to  divide  the  data  set  into  two  separate  regions  and  compute 
the  mass  in  each  region.  The  mass  distribution  at  point  x  is  estimated  to  be  the  sum  of  all 
‘weighted’  masses  from  regions  occupied  by  x,  as  a  result  of  n  —  1  binary  splits  for  a  data 
set  of  size  n . 

Let  X\  <  X2  <  ■  ■  ■  <  xn  _ 1  <  xn  on  the  real  line,1  x,-  e  1Z  and  n  >  1.  Let  s,  be  the  binary 
split  between  x,-  and  xI+i,  yielding  two  non-empty  regions  having  two  masses  mb  and  mf . 

Definition  1  Mass  base  function:  ;w;(x)  as  a  result  of  .v, ,  is  defined  as 

.  f  mb  if  x  is  on  the  left  of 
m,  (x)  =  i  . 

[  m"  it  x  is  on  the  right  of  s; 

Note  that  mb  =  n  —  mf  =  i. 

Definition  2  Mass  distribution:  mass(xa )  for  a  point  xa  €  {xi,  X2, . . . ,  x„_i,  x„j  is  defined 
as  a  summation  of  a  series  of  mass  base  functions  m,  (x)  weighted  by  p(Sj)  over  n  —  1  splits 
as  follows,  where  p(s ,•)  is  the  probability  of  selecting  st. 

n- 1 

mass(xa )  =  E  mi(xa)p(si) 

i=i 

Ti—l  a- 1 

=  P(si)  +  ^2mfp(Sj) 

i=a  1=1 

n—1  a— 1 

=Ew*>+E<—-»^>  c1) 

i=a  j=l 

Note  that  we  have  defined  /(/)  =  0,  when  r  <  q  for  any  function  /. 

Example  For  an  example  of  five  points  xi  <  X2  <  x3  <  x4  <  x5.  Fig.  1  shows  the  resultant 
m i  (x )  due  to  each  of  the  four  binary  splits  .Vi ,  Si,  .S'3,  S4;  and  their  associated  masses  over  four 
splits  are  given  below: 

mass(x  1)  =  lpUi)  +  2  p(s2)  +  3  p(s3)  +  Ap(s^) 
mass(x2 )  =  4p(sx)  +  2  p(s2)  +  3  p(s3)  +  4  p(s4) 
mass(x  3)  =  4pOi)  +  3  p(s2)  +  3  p(s3)  +  4  p(s4) 
mass(x4)  =  4p(si)  +  3  p(s2)  +  2  p03)  +  4p(s4) 
mass(x5 )  =  4p(si)  +  3  p(s2)  +  2  p(s3)  +  lp(s4) 


1  In  data  having  a  pocket  of  points  of  the  same  value,  an  arbitrary  order  can  be  ‘forced’  by  adding  increasing 
multiples  of  an  insignificant  small  value  6  to  each  subsequent  point  of  the  pocket,  without  changing  the 
general  distribution. 
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L  1 

m,j=l 

11 

" 

*  *  *  * 

'a)  rrii(x)  due  to  si 

(b)  rrii(x )  due  to  S2 


4 

3 

2 

1 


m 


L 

3 


=3 


,=2 


(c)  mj(i)  due  to  S3 


1/77  =7 


1  *2  *3  * 4 

(d)  m,i{x)  due  to  S4 


Fig.  1  Examples  of  mass  base  function  due  to  each  of  the  four  binary  splits:  s  | ,  S2 ,  S3 .  .S4 


For  a  given  data  set,  p (s,- )  can  be  estimated  on  the  real  line  as  p(st )  =  (x!+ 1  —  3 c,)/ 
(x„  —  jci  )  >  0,  as  a  result  of  a  random  selection  of  splits  based  on  a  uniform  distribution.2 


For  a  point  x  ^  [xi,  X2, . . . ,  jc„_i ,  jc„},  mass(x)  is  defined  as  an  interpolation  between  two 
masses  of  adjacent  points  x,-  and  x1+i,  where  xt  <  x  <  X;+ 1 . 


Theorem  1  mass(xa )  maximised  at  a  =  n /I  for  any  density  distribution  of  {x  1, . . . ,  x„}; 
and  the  points  xa,  where  xi  <  X2  <  •  •  •  <  Xtj-i  <  xn  on  the  real  line,  can  be  ordered  based 
on  mass  as  follows. 


massfXo)  <  mass(xa+ 1),  a  <  nil 
mass(xa )  >  mass(xa+i) ,  a  >  n/2 


Proof  The  difference  in  mass  between  two  consecutive  points  xa  and  .va+i  differs  in  only 
one  term,  i.e.,  the  mass  associated  with  p(sa);  and  Vi  a,  the  terms  for  p(sj)  have  the  same 
mass. 


2The  estimated  mass(x )  values  can  be  calibrated  to  a  finite  data  range  A  by  multiplying  a  factor  (xn  —  x\)/ A. 
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n— 1  a— 1 

mass(xa )  -  mass(xa+l )  =  X  ipO;)  +  X(”  -  j)p(sj) 

i=a  7=1 

n— 1  a 

-  X!  ~  X<"  -  j)p(s]) 

;=(“+!)  j=1 

=  ap(sa )  -  (n  —  a)p(sa) 


=  (2a  -n)p(Sa) 


(2) 


Thus, 


(negative  ifa<n/2 
0  if  a  =  n/2 

positive  if  a  >  n  / 2 

The  point  xn/2  can  be  regarded  as  the  median.  Note  that  the  number  of  points  with  the 
maximum  mass  depends  on  whether  n  is  odd  or  even:  When  n  is  an  odd  integer,  only  one 
point  has  the  maximum  mass  at  xmecnan,  where  median  =  /2] ;  when  n  is  an  even  integer, 

two  points  have  the  maximum  mass  at  a  =  n /2  and  a  =  1  +  n/2.  □ 


Theorem  2  mass(xa)  is  a  concave  function  defined  w.r.t.  {xi,  Xi, . . . ,  xn],  when  p(s ,)  = 
fe+i  -Xi)/(xn  —  x\)  for  n  >  2. 


Proof  We  only  need  to  show  that  the  gradient  of  xa  is  non-increasing,  i.e.,  g(xa)  >  g(xa+i) 
for  each  a. 

Let  g(xa)  be  the  gradient  between  xa  and  xa+i,  and  from  (2): 

mass(xa+ 1)  —  mass(xa)  n  —  2  a 

g(*a)  = - — - =  — 3 — 

Xa+l  Xa  Xfi  X  j 

The  result  follows:  g(xa)  >  g(xa+i)  for  a  €  {1,  2, . . . ,  n  —  1}.  □ 


Corollary  1  A  mass  distribution  estimated  using  binary  splits  stipulates  an  ordering,  based 
on  mass,  of  the  points  in  a  data  cloud  from  xn/2  ( with  the  maximum  mass)  to  the  fringe 
points  ( with  the  minimum  mass  at  either  side  of  xn/ 2),  irrespective  of  the  density  distribution 
including  uniform  density  distribution. 


Corollary  2  The  concavity  of  mass  distribution  stipulates  that  fringe  points  have  markedly 
smaller  mass  than  points  close  to  xn/2- 

The  implication  from  Corollary  2  is  that  fringe  points  are  ‘stretched’  to  be  farther  away 
from  the  median  in  a  mass  space  than  in  the  original  space — making  it  easier  to  separate 
fringe  points  from  those  points  close  to  the  median.  The  mass  space  is  mapped  from  the 
original  space  through  mass(x).  This  property  in  mass  space  can  then  be  exploited  by  a 
machine  learning  algorithm  to  achieve  a  better  result  for  the  intended  task  than  applying  the 
same  algorithm  in  the  original  space  without  this  property.  We  will  show  that  this  simple 
mapping  significantly  improves  the  performance  of  four  existing  algorithms  in  information 
retrieval  and  regression  tasks  in  Sects.  6.1  and  6.2. 

Equation  (1)  is  sufficient  to  provide  a  mass  distribution  corresponding  to  a  unimodal 
density  function  or  a  uniform  density  function.  To  better  estimate  multi-modal  mass  dis¬ 
tributions,  multiple  levels  of  binary  splits  need  to  be  carried  out.  This  is  provided  in  the 
following. 
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Fig.  2  Two  examples  of  massb (x,  h  =  1)  and  massf(x,  h  =  1)  due  to  s;= 7  and  s;=n  in  the  process  to  get 
inass(x ,  h  =  2)  from  a  data  set  of  20  points  with  uniform  density  distribution.  The  resultant  mass(x,  h  =  2) 
is  shown  in  Fig.  3(a) 


2.1.2  Level-h  mass  distribution  estimation 

If  we  treat  the  mass  estimation  defined  in  the  last  subsection  as  level- 1  estimation,  then 
level-/;  estimation  can  be  viewed  as  localised  versions  of  the  basic  level- 1  estimation. 


Definition  3  The  level-/;  mass  distribution  for  a  point  xa  €  [x\, . .. ,  x„ } ,  where  /;  <  n.  is 
expressed  as 

n— 1 

mass(xa ,  h)  =  massi  (. xa ,  h- 1  )p (si ) 

1=1 

n— 1 

=  ^ mass-' (xa,  h-l)p(si) 

i=a 

a— 1 

+  Y'  mass f  (xa,h-l)p  ( Sj )  (3) 

i= i 


Here  a  high  level  mass  distribution  is  computed  recursively  by  using  the  mass  distribu¬ 
tions  obtained  at  lower  levels.  A  binary  split  s;  in  a  level-/;  (>  1)  mass  distribution  produces 
two  level-//;- 1)  mass  distributions:  (a)  mass''  (x ,  /7-1) — the  mass  distribution  on  the  left  of 
split  Si  which  is  defined  using  {xi, . . . ,  x,};  and  (b)  massf  (x ,  h- 1) — the  mass  distribution 
on  the  right  which  is  defined  using  {x1+i, . . . ,  x„}.  Equation  (1)  is  the  mass  distribution  at 
level- 1 . 

Figure  2  shows  two  (out  of  19  splits)  required  to  compute  level-2  mass  estimation, 
mass(x,  h  =  2),  from  a  data  set  of  20  points.  Each  split  produces  two  level-1  mass  estima¬ 
tions:  massj'(x,  h  =  1)  and  massf(x ,  h  =  1).  Note  that  level-1  mass  distribution  is  concave, 
as  proven  in  Theorem  2.  This  example  shows  the  results  of  two  splits  s!= 7  and  s1=u,  where 
each  level- 1  mass  distribution  is  concave. 

Using  the  same  analysis  as  in  the  proof  for  Theorem  1,  the  above  equation  can  be  re¬ 
expressed  as: 


mass(xa+ 1,  h)  =  mass(xa,  h)  + 


[mass £  (xfl,  /;- 1 ) 
(n  -  2a)p(sa), 


mass'a  (xa ,  h-\)\p{sa),  h>\ 

h  =  1  '' ’ 


As  a  result,  only  the  mass  for  the  first  point  xj  needs  to  be  computed  using  (3).  Note  that 
it  is  more  efficient  to  compute  the  mass  distribution  from  the  above  equation  which  has  time 
complexity  0(n/,+1);  the  computation  using  (3)  has  complexity  0(nh+2). 


(A  Springer 


Mach  Learn 


12 
10 
8  ^ 
6  °- 
4 
2 
0 


(a)  uniform 


(b)  trimodal 


Fig.  3  Examples  of  level-/;  mass  distribution  for  h  =  1, 2,  3  and  density  distribution  from  kernel  density 
estimation:  Gaussian  kernel  with  bandwidth  =  0.1  (for  the  first  three  figures)  and  0.01  (for  the  last  figure  in 
order  to  show  the  density  spike).  The  data  sets  have  20  points  each  for  the  first  three  figures,  and  the  last  one 
has  50  points 


Definition  4  A  level-/;  mass  distribution  stipulates  an  ordering  of  the  points  in  a  data  cloud 
from  a-core  points  to  the  fringe  points.  Let  a -neighbourhood  of  a  point  x  be  defined  as 
Na(x)  =  { v  G  D\dist(x,  y )  <  a}  for  some  distance  function  dist(-,  ■).  Each  cr-core  point  x* 
in  a  data  cloud  has  the  highest  mass  value  Vx  G  Na(x*).  A  small  a  defines  local  core  point(s); 
and  a  large  a,  which  covers  the  entire  value  range  for  x,  defines  global  core  point(s). 

Examples  of  level-/i  mass  estimation  in  comparison  with  kernel  density  estimation  are 
provided  in  Fig.  3.  Note  that  h  =  1  mass  estimation  looks  at  the  data  as  a  group,  and  it 
produces  a  concave  function.  As  a  result,  an  /;  =  1  mass  estimation  always  has  its  global 
core  point(s)  at  the  median,  regardless  of  the  underlying  density  distribution — see  the  four 
examples  of  /;  =  1  mass  estimation  in  Fig.  3. 

For  h  >  1  mass  distribution,  though  there  is  no  guarantee  for  a  concave  function  any 
more  as  a  whole,  our  simulation  shows  that  each  cluster  within  the  data  cloud  (if  they  ex¬ 
ist)  exhibits  a  concave  function  and  it  becomes  more  distinct  (as  a  concave  function)  as  h 
increases.  An  example  is  shown  in  Fig.  3(b)  which  has  a  trimodal  density  distribution.  No¬ 
tice  that  the  h  >  1  mass  distributions  have  three  a-core  points  for  some  a,  e.g.,  0.2.  Other 
examples  are  shown  in  Figs.  3(c)  and  3(d). 

Traditionally,  one  can  estimate  the  core-ness  or  the  fringe-ness  of  non-uniformly  dis¬ 
tributed  data  to  some  degree  by  using  density  or  distance  (but  not  in  uniform  density  distribu¬ 
tion).  Mass  allows  one  to  do  that  in  any  distribution  without  density  or  distance  calculation — 
the  key  computational  expense  in  all  methods  that  employ  them.  For  example  in  Fig.  3(c) 
which  has  a  skew  density  distribution,  the  distinction  between  near  fringe  points  and  far 
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fringe  points  are  less  obvious  using  density,  unless  distances  are  computed  to  reveal  the 
difference.  In  contrast,  mass  distribution  depicts  the  relative  distance  from  xme(nan  using  the 
fringe  points’  mass  values,  without  further  calculation. 

Figure  3(d)  shows  an  example  where  there  are  clustered  anomalies  which  are  denser  than 
the  normal  points  (shown  in  the  bigger  cluster  on  the  left  of  the  figure).  Anomaly  detection 
based  on  density  will  identify  all  these  clustered  anomalies  as  more  ‘normal’  than  the  normal 
points  because  anomalies  are  defined  as  points  having  low  density.  In  sharp  contrast,  h  =  1 
mass  estimation  will  correctly  rank  them  as  anomalies  which  have  the  third  lowest  mass 
values.  These  points  are  interpreted  as  points  at  the  fringe  of  the  data  cloud  of  normal  points 
which  have  higher  mass  values. 

This  section  has  described  properties  of  mass  distribution  from  a  theoretical  perspective. 
Though  it  is  possible  to  estimate  mass  distribution  using  (1)  and  (3),  they  are  limited  by  its 
high  computational  cost.  We  suggest  a  practical  mass  estimation  method  in  the  next  subsec¬ 
tion.  We  use  the  term  ‘mass  estimation’  and  ‘mass  distribution  estimation’  interchangeably 
hereafter. 


2.2  Practical  one-dimensional  level-/7  mass  estimation 


Here  we  devise  an  approximation  to  (3)  using  random  subsamples  from  a  given  data  set. 

Definition  5  mass(x ,  h\V)  is  the  approximate  mass  distribution  for  a  point  x  e  1Z,  defined 
w.r.t.  V  =  {x\, . . . ,  Xty],  where  V  is  a  random  subset  of  the  given  data  set  D ,  and  <$;  ]D|, 
h  <  iff. 


Here  we  employ  a  rectangular  function  to  define  mass  for  a  region  encompassing  each 
point  x  e  V.  mass(x,  h  \  V)  is  implemented  using  a  lookup  table  where  a  region  for  each  point 
Xj  e  V  covers  a  range  (jc,_i  +  x,)/2  <  x  <  (jc;+i  +  -t,)/2  and  has  the  same  mass(xi,  h\V) 
value  for  the  entire  region.  The  range  for  each  of  the  two  end-points  is  set  to  have  equal 
length  on  either  side  of  the  point.  An  example  is  provided  in  Fig.  4(a). 

In  addition,  a  number  of  mass  distributions  needs  to  be  constructed  from  different  sam¬ 
ples  in  order  to  have  a  good  approximation,  that  is, 


mass(x,  h) 


1  x 

—  >  mass(x,  h\T>[:) 
c  — ' 

k=  1 


(5) 


The  computation  of  mass{x,  h)  using  the  given  data  set  D  costs  0{\D\h+l)  in  terms  of  time 
complexity;  whereas  mass(x,h\V)  costs  0(Yf,I+1). 

Only  relative,  not  absolute,  mass  is  required  to  provide  an  ordering  between  instances. 
For  h  =  1,  because  the  relative  mass  is  w.r.t.  the  median  and  the  median  is  a  robust  estima¬ 
tor  (Aloupis  2006) — that  is  why  small  subsamples  produce  a  good  estimator  for  ordering. 
While  this  reason  cannot  be  applied  to  h  >  1  (and  multi-dimensional  mass  estimation  to 
be  discussed  in  the  next  section)  because  the  notion  of  median  is  undefined,  our  empirical 
results  in  Sect.  6  show  that  all  these  mass  estimations  using  small  subsamples  produce  good 
results. 

In  order  to  show  that  relative  performance  of  mass(x,  1)  and  mass(x,  l\V),  we  compare 
the  ordering  results  based  on  mass  values  in  two  separate  data  sets:  the  one-dimensional 
Gaussian  density  distribution  and  the  COREL  data  set;  each  of  the  data  sets  has  10000 
data  points.  Figure  4(b)  shows  the  correlation  (in  terms  of  Spearman’s  rank  correlation 
coefficient)  between  the  orderings  provided  by  mass{x ,  1)  using  the  entire  data  set  and 
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Fig.  4  (a)  An  example  of 
practical  mass  distribution 
mass(x,  h\D)  for  5  points, 
assuming  a  rectangular  function 
for  each  point,  (b)  Correlation 
between  the  orderings  provided 
by  mas s{x,  1)  and  mass(x,  \  \V) 
for  two  data  sets: 
one-dimensional  Gaussian 
density  distribution  and  the 
COREL  data  set  used  in  Sect.  6.1 
(whose  result  is  averaged  over  67 
dimensions) 


(b) 


mass(x,  1| V)  using  i/r  =  8.  They  achieve  very  high  correlations  when  c  >  100  for  both 
data  sets. 

The  ability  to  use  a  small  sample,  rather  than  a  large  sample,  is  a  key  characteristic  of 
mass  estimation. 


3  Multi-dimensional  mass  estimation 

Ting  and  Wells  (2010)  describe  a  way  to  generalise  the  one-dimensional  mass  estimation  we 
have  described  in  the  last  section.  We  reiterate  the  approach  in  this  section  but  the  imple¬ 
mentation  we  employed  (to  be  described  in  Sect.  4)  differs.  Section  9  provides  the  details  of 
these  differences. 

The  approach  proposed  by  Ting  and  Wells  (2010)  eliminates  the  need  to  compute  the 
probability  of  a  binary  split,  p  (s, ) ;  and  it  gives  rise  to  randomised  versions  of  (1),  (3)  and  (5). 

The  idea  is  to  generate  multiple  random  regions  which  cover  a  point,  and  then  the  mass 
for  that  point  is  estimated  by  averaging  all  masses  from  all  those  regions.  We  show  that 
random  regions  can  be  generated  using  axis-parallel  splits  called  half-space  splits.  Each 
half-space  split  is  performed  on  a  randomly  selected  attribute  in  a  multi-dimensional  feature 
space.  For  a  /i-level  split,  each  half-space  split  is  earned  out  h  times  recursively  along  ev¬ 
ery  path  in  a  tree  structure.  Each  /i-level  (axis-parallel)  split  generates  2h  non-overlapping 
regions.  Multiple  /r-level  splits  are  used  to  estimate  mass  for  each  point  in  the  feature  space. 

The  multi-dimensional  mass  estimation  requires  two  functions.  First,  it  needs  a  function 
that  generates  random  regions  covering  each  point  in  the  feature  space.  This  function  is  a 
generalisation  of  the  binary  split  into  half-space  splits  or  2h  -region  splits  when  h  levels  of 
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half-space  splits  are  used.  Second,  a  generalised  version  of  the  mass  base  function  is  used  to 
define  mass  in  a  region.  The  formal  definition  follows. 

Let  x  be  an  instance  in  1Zd .  Let  Th  (x)  be  one  of  the  2h  regions  in  which  x  falls  into;  Th  (•) 
is  generated  from  the  given  data  set  D,  and  Th(-\V)  is  generated  from  DcD;  and  m  be  the 
number  of  training  instances  in  the  region. 

The  generalised  mass  base  function:  m (Th(x))  is  defined  as 


m(T"(x)) 


m  if  x  is  in  a  region  of  Th  having  m  instances 
0  otherwise 


In  one-dimensional  problems,  (1),  (3)  and  (5)  can  now  be  approximated  as  follows: 


n— 1  i  c 

Y m  c x)p(si ) » -  y  m(Tk  w) 

i=t  c  k= t 

(6) 

mass(x,  h)  ~  -  Y^  m(jk  W) 
c  k=  l 

(7) 

1  ^  ^ 

mass(x,  h )  sa  -  ^  m (t£ (x[Dkj) 

c  k=  l 

(8) 

where  c  >  0  is  the  number  of  random  regions  to  be  used  to  define  the  mass  for  x . 

Here  every  Tk  is  generated  randomly  with  equal  probability.  Note  that  p(s,)  in  (1)  has 
the  same  assumption. 

Since  Th  is  defined  in  multi-dimensional  space,  the  multi-dimensional  mass  estimation 
is  the  same  as  (8)  by  simply  replacing  x  with  x: 

1  ,c 

mass(x ,  h)~  —  J^m(Tkh(x\Vk))  (9) 

c  k=  i 

Like  its  one-dimensional  counterpart,  the  multi-dimensional  mass  estimation  stipulates 
an  ordering  from  core  points  (having  high  mass)  to  fringe  points  (having  low  mass)  in  a  data 
cloud,  regardless  of  its  density  distribution.  While  we  do  not  have  a  proof  of  this  property 
for  multi-dimensional  mass  estimation,  empirical  results  suggest  that  it  is.  This  property  is 
shown  in  Fig.  5(a)  using  h  =  1,  where  the  highest  mass  value  is  at  the  centre  of  the  entire  data 
cloud,  when  the  four  clusters  are  treated  as  a  single  data  cloud;  while  the  four  clusters  are 
scattered  in  each  of  the  four  quadrants.  Mass  values  become  lower  as  they  move  away  from 
the  centre.  Note  that  though  this  implementation  of  multi-dimensional  mass  estimation  does 
not  guarantee  concavity,  the  approximation  of  the  ordering  is  sufficiently  close  to  a  concave 
function  (in  regions  with  data)  to  produce  the  required  ranking  for  different  purposes  (see 
Sect.  6). 

Figure  5(b)  shows  the  contour  map  for  h  =  32  on  the  same  data  set.  It  demonstrates  that 
multi-dimensional  mass  estimation  can  use  a  high  h  level  to  model  multi-modal  distribution. 

We  show  in  Sect.  6  that  both  mass(x,  h\V)  and  m(7’,I(x|2?))  (in  (5)  and  (9),  respectively) 
can  be  employed  effectively  for  three  different  tasks:  information  retrieval,  regression  and 
anomaly  detection,  through  the  mass-based  formalism  described  in  Sect.  5.  We  shall  de¬ 
scribe  the  implementation  of  multi-dimensional  mass  estimation  in  the  next  section. 
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Fig.  5  Contour  maps  of  multi-dimensional  mass  distribution  for  a  two-dimensional  data  set  with  four  clusters 
(each  containing  50  points),  where  points  in  each  cluster  are  marked  with  a  distinct  marker.  The  points  are 
randomly  drawn  from  Gaussian  distributions  with  unit  standard  deviation  and  means  located  at  (2;  2),  (—2;  2), 
(—2;  —2)  and  (2;  —2),  respectively.  The  two  figures  are  produced  using  h  =  1  and  h  =  32,  respectively.  Other 
parameters  are  set  as  follows:  c  =  1000  and  \J/  =  \T>\  =  64.  The  algorithm  used  to  generate  these  contour 
maps  will  be  described  in  Sect.  4.2.  The  legend  indicates  the  colour-coded  mass  values 


4  Half-Space  Trees  for  mass  estimation 

This  section  describes  the  implementation  of  Th  using  Half-Space  Tree.  Two  variants  are 
provided.  We  have  used  the  second  variant  of  Half-Space  Tree  to  implement  the  multi¬ 
dimensional  mass  estimation. 

4.1  Half-Space  Tree 

The  motivation  of  the  proposed  method,  Half-Space  Tree,  comes  from  the  fact  that  equal- 
size  regions  contain  the  same  mass  in  a  space  with  uniform  mass  distribution,  regardless  of 
the  shapes  of  the  regions.  This  is  shown  in  Fig.  6(a),  where  the  space  enveloped  by  the  data 
is  split  into  equal-size  half-spaces  recursively  three  times  into  eight  regions.  Note  that  the 
shapes  of  the  regions  may  be  different  because  the  splits  at  the  same  level  may  not  use  the 
same  attribute. 

The  binary  half-space  split  ensures  that  every  split  produces  two  equal-size  half-spaces, 
each  containing  exactly  half  of  the  mass  before  the  split  under  a  uniform  mass  distribution. 
This  characteristic  enables  us  to  compute  the  relationship  between  any  two  regions  easily. 
For  example,  the  mass  in  every  region  shown  in  Fig.  6(a)  is  the  same,  and  it  is  equivalent 
to  the  original  mass  divided  by  23  because  three  levels  of  binary  half-space  splits  have  been 
applied.  A  deviation  from  the  uniform  mass  distribution  allows  us  to  rank  the  regions  based 
on  mass.  Figure  6(b)  provides  such  an  example  in  which  a  ranking  of  regions  based  on  mass 
provides  an  order  of  the  degrees  of  anomaly  in  each  region. 

Definition  6  Half-Space  Tree  is  a  binary  tree  in  which  each  internal  node  makes  a  half¬ 
space  split  into  two  equal-size  regions,  and  each  external  node  terminates  further  splits.  All 
nodes  record  the  mass  of  the  training  data  in  their  own  regions. 

Let  Th[i]  be  a  Half-Space  Tree  with  depth  level  i;  and  m(7’,'[f])  or  short  for  m[i]  be  the 
mass  in  one  of  the  regions  at  level  i . 

The  relationship  between  any  two  regions  is  expressed  using  mass  with  reference  to  m[0] 
at  depth  level  =  0  (the  root)  of  a  Half-Space  Tree. 
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Fig.  6  Half-space  subdivisions 
of:  (a)  uniform  mass  distribution; 
and  (b)  non-uniform  mass 
distribution 
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(a)  Uniform  mass  distribution. 
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(b)  Non-uniform  mass  distribution. 


Under  uniform  mass  distribution,  the  mass  at  level  i  is  related  to  mass  at  level  0  as 
follows: 

m[0]  =m[i]  x  2\ 

or  mass  values  between  any  two  regions  at  levels  i  and  j,  Vi  ^  j,  are  related  as  follows: 

m[i]  x  2'  =  m|  / 1  x  2;. 

Under  non-uniform  mass  distribution,  the  following  inequality  establishes  an  ordering  be¬ 
tween  any  two  regions  at  different  levels: 

m[i]  x  2'  <  m|  / 1  x  2'. 

We  employ  the  above  property  to  rank  instances  and  define  the  (augmented)  mass  for 
Half-Space  Tree  as  follows. 

.s(x)  =  ni|£  |  x  2l,  (10) 

where  i  is  the  depth  level  of  an  external  node  with  m[£]  instances  in  which  a  test  instance  x 
falls  into. 

Mass  is  estimated  using  m[f  ]  only  if  a  Half-Space  Tree  has  all  external  nodes  at  the  same 
depth  level.  The  estimation  is  based  on  augmented  mass,  m[£]  x  2e,  if  the  external  nodes 
have  differing  depth  levels.  We  describe  two  such  variants  of  Half-Space  Tree  below. 

HS-Tree:  based  on  mass  only.  The  first  variant,  HS-Tree,  builds  a  balanced  binary  tree 
structure  which  makes  a  half-space  split  at  each  internal  node  and  all  external  nodes  have 
the  same  depth.  The  number  of  training  instances  falling  into  each  external  node  is  recorded 
and  it  is  used  for  mass  estimation.  An  example  of  HS-Tree  is  shown  in  Fig.  7(a). 

HS*-Tree:  based  on  augmented  mass.  Unlike  HS-Tree,  the  second  variant,  HS*-Tree, 
whose  external  nodes  have  differing  depth  levels.  The  mass  estimated  from  HS*-Tree  is 
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Fig.  7  Half-Space  Tree: 

(a)  HS-Tree:  An  HS-Tree  for  the 
data  shown  in  Fig.  6(a)  has 

m;  =  4,  Vr ,  which  are  m[£  =  3] 
(i.e.,  mass  at  level  3). 

(b)  HS*-Tree:  An  example  of  a 
special  case  of  HS*-Tree  when 
the  size  limit  is  set  to  1 


(b)  HS*-Tree. 


defined  in  equation  (10)  in  order  to  account  for  different  depths.  We  call  this  augmented 
mass  because  the  mass  is  augmented  in  the  calculation  by  the  depth  level  in  HS*-Tree,  as 
opposed  to  mass  only  in  HS-Tree. 

In  a  special  case  of  HS*-Tree,  the  tree  growing  process  at  a  branch  will  only  terminate 
to  form  an  external  node  if  the  training  data  size  at  the  branch  is  1  (i.e.,  the  size  limit  is  set 
to  1).  Here  the  mass  estimated  depends  on  depth  level  only,  i.e.,  2l  or  simply  l.  In  other 
words,  the  depth  level  becomes  a  proxy  for  mass  in  HS*-Tree  when  the  size  limit  is  set  to  1. 
An  example  of  HS*-Tree,  when  the  size  limit  is  set  to  1,  is  shown  in  Fig.  7(b). 

Since  the  two  variants  have  similar  performance,  we  focus  on  HS*-Tree  only  in  this 
paper  because  it  builds  a  smaller-sized  tree  than  HS-Tree  which  may  grow  many  branches 
with  zero  mass — this  saves  on  training  time  and  memory  space  requirements. 

4.2  Algorithm  to  generate  Half-Space  Trees 

Half-Space  Trees  estimate  a  mass  distribution  efficiently,  without  density  or  distance  cal¬ 
culations  or  clustering.  We  first  describe  the  training  procedure,  then  the  testing  procedure, 
and  finally  the  time  and  space  complexities. 

Training.  The  procedure  to  generate  a  Half-Space  Tree  is  shown  in  Algorithm  1 .  It  starts 
by  defining  a  (random)  range  for  each  dimension  in  order  to  form  a  work  space  which 
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Algorithm  1  Th(D,  S ,  h) 

Inputs:  V — input  data,  S — data  size  limit  at  external  node,  h — maximum  depth  limit 
Output:  a  Half-Space  Tree 


SizeLimit  <—  S 


2:  MaxDepthLimit  <—  h 

3:  (min,  max)  <—  InitialiseWorkSpace(D) 

4:  return  SingleHalf-SpaceTree(D,  min,  max,  0) 


covers  all  the  training  data.  The  InitialiseWorkSpace(-)  function  in  Algorithm  1  is  carried 
out  as  follows.  For  each  attribute  q,  a  random  split  value  (zq)  is  chosen  within  the  range 
[Vminq ,  Dmaxq\,  i.e.,  the  minimum  and  maximum  values  of  q  in  the  subsample.  Then, 
attribute  q  of  the  work  space  is  defined  to  have  the  range  | minq ,  maxq]  =  —  r,  zq  +  r], 

where  r  =  2  ■  max(z9  —  Vminq,  Dmaxq  —  zq).  The  ranges  of  all  dimensions  define  the  work 
space  used  to  generate  a  Half-Space  Tree.  The  work  space  defined  by  |  minq ,  maxq  |  is  then 
passed  over  to  Algorithm  2  to  construct  a  Half-Space  Tree. 

Constructing  a  single  Half-Space  Tree  is  almost  identical  to  constructing  an  ordinary 
decision  treeJ  (Quinlan  1993),  except  that  no  splitting  selection  criterion  is  required  at  each 
node. 

Given  a  work  space,  an  attribute  q  is  randomly  selected  to  form  an  internal  node  of 
an  Half-Space  Tree  (line  4  in  Algorithm  2).  The  split  point  of  this  internal  node  is  simply 
the  mid-point  between  the  minimum  and  maximum  values  of  attribute  q  (i.e.,  minq  and 
maxq),  defined  by  the  work  space  (line  5).  Data  are  filtered  through  one  of  the  two  branches 
depending  on  which  side  of  the  split  the  data  reside  (lines  6-7).  This  node  building  process 
is  repeated  for  each  branch  (lines  9-12  in  Algorithm  2)  until  a  size  limit  or  a  depth  limit 
is  reached  to  form  an  external  node  (lines  1-2  in  Algorithm  2).  The  training  instances  at 
the  external  node  at  depth  level  i  form  the  mass  m(7’,I(x|T>))  to  be  used  during  the  testing 
process  for  x.  The  parameters  are  set  as  follows:  S  =  log2(|D|)  —  1  and  h  =  \V\  for  all  the 
experiments  conducted  in  this  paper. 

Ensemble.  The  proposed  method  uses  a  random  subsample  V  to  build  one  Half-Space 
Tree  (i.e.,  Th(-\D)),  and  multiple  Half-Space  Trees  are  constructed  from  different  random 
subsamples  (using  sampling  without  replacement)  to  form  an  ensemble. 

Testing.  During  testing,  a  test  instance  x  traverses  through  each  Half-Space  Tree  from  the 
root  to  an  external  node,  and  the  mass  recorded  at  the  external  node  is  used  to  compute  its 
augmented  mass  (see  (11)  below).  This  testing  is  carried  out  for  all  Half-Space  Trees  in  the 
ensemble,  and  the  final  score  is  the  average  score  from  all  trees,  as  expressed  in  (12)  below. 

The  mass,  augmented  by  depth  l  of  the  region  of  Half-Space  Tree  Th  in  which  x  falls 
into,  is  given  as  follows. 


s(x,  h)  =  m(7’,'(x|T>))  x  2l 


(11) 


Mass  needs  to  be  augmented  with  depth  £  of  a  Half-Space  Tree  in  order  to  ‘normalise’ 
the  masses  from  different  depths  in  the  tree. 

The  mass  for  point  x  estimated  from  an  ensemble  of  c  Half-Space  Trees  is  given  as 
follows. 


(12) 


^However,  they  are  for  different  tasks:  Decision  trees  are  for  supervised  learning  tasks;  Half-Space  trees  are 
for  unsupervised  learning  tasks. 
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Algorithm  2  S i  n gl c H al f- S paccT rec CD .  min,  max,  £) 

Inputs:  V — input  data,  min  &  max — arrays  of  minimum  and  maximum  values  for  all  at¬ 
tributes  in  a  work  space,  t — current  depth  level 
Output:  a  Half-Space  Tree 
1:  if  [\D\  <  SizeLimit)  or  (f  >  MaxDepthLimit)  then 
2:  return  exNode{Size  <—  \V\) 

3:  else 

4:  randomly  select  an  attribute  q 

5:  midq  <—  ( maxq  +minq)/2  [midq  is  the  mid-point  value  between  the  current  minq  and 

minq  values  of  q.} 

6:  T>i  <— filter(V ,  q  <  midq)  {Extract  data  satisfying  condition:  q  <  midq.) 

7:  Vr  <— filter(V ,  q  >  midq )  {Extract  data  satisfying  condition:  q  >  midq.} 

8:  {Build  two  nodes:  Left  and  Right  as  a  result  of  a  split  into  two  equal-volume  half¬ 

spaces.  } 

9:  temp  <—  maxq ;  maxq  <—  midq 

10:  Left  <—  SingleHalf-SpaceTree(D;,  min,  max,  l  +  1) 

11:  maxq  <—  teinp\  minq  <—  mulq 

12:  Right  <—  SingleHalf-SpaceTreeCD,.,  min,  max,  £  +  1) 

13:  return  inNodefLeft,  Right ,  SplitAtt  <—  q,  SplitValue  <—  midq) 

14:  end  if 


Time  and  Space  complexities.  Because  it  involves  no  evaluations  or  searches,  a  Half- 
Space  Tree  can  be  generated  quickly.  In  addition,  a  good  performing  Half-Space  Tree  can 
be  generated  using  only  a  small  subsample  (size  f  )  from  a  given  data  set  of  size  n,  where 
\jr  n.  An  ensemble  of  Half-Space  Trees  has  training  time  complexity  0(c/ii/r)  which  is 
constant  for  an  ensemble  with  fixed  subsample  size  i/r,  maximum  depth  level  h  and  ensemble 
size  c.  It  has  time  complexity  0(chn)  during  testing.  The  space  complexity  for  Half-Space 
Trees  is  O(chf)  and  is  also  a  constant  for  an  ensemble  with  fixed  subsample  size,  maximum 
depth  level  and  ensemble  size. 


5  Mass-based  formalism 

The  data  ordering  expressed  as  a  mass  distribution  can  be  interpreted  as  a  measure  of  rel¬ 
evance  with  respect  to  the  concept  underlying  the  data,  i.e.,  points  having  high  mass  are 
highly  relevant  to  the  concept  and  points  having  low  mass  are  less  relevant.  In  tasks  whose 
primary  aim  is  to  rank  points  in  a  database  with  reference  to  a  data  profile,  mass  provides  the 
ideal  ranking  measure  without  distance  or  density  calculations.  In  anomaly  detection,  high 
mass  signifies  normal  points  and  low  mass  signifies  anomalies;  in  information  retrieval,  high 
(low)  mass  signifies  that  a  database  point  is  highly  (less)  relevant  to  the  query.  Even  in  tasks 
whose  primary  aim  is  not  ranking,  the  transformed  mass  space  can  be  better  exploited  by 
existing  algorithms  because  the  transformation  stretches  concept-irrelevant  points  farther 
away  from  relevant  points  in  the  mass  space. 

We  introduce  a  formalism  in  which  mass  can  be  applied  to  different  tasks  in  this  section, 
and  provide  the  empirical  evaluation  in  the  following  section. 

Let  X,-  =  | xj , . . . ,  x  “  | ;  x,-  e  D  of  u  dimensions;  and  z,  =  |  ’ ' , . . . ,  z'  |;  z,-  e  D'  in  the  trans¬ 
formed  mass  space  of  t  dimensions.  The  proposed  formalism  consists  of  three  components: 
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Algorithm  3  Mass_Estimation(D,  i//.  h,  t) 

Inputs:  D — input  data;  i/r — data  size  for  Vk\  h — level  of  mass  distribution;  t — number  of 
mass  distributions. 

Output:  mass(x)  ->  TV — a  function  consists  of  t  mass  distributions,  using  either  one¬ 
dimensional  or  multi-dimensional  mass  estimation:  mass(xd ,  h \Vk)  or  m(T^(x\Vk)). 

1 :  for  k  =  1  to  t  do 

2:  Vk  <—  a  random  subset  of  size  i fr  from  D\ 

3:  d  a  randomly  selected  dimension  from  {  1, _ u  }; 

4:  Build  mass(xd ,  h\T>k)\ 

5:  end  for 

Note:  For  multi-dimensional  mass  estimation,  steps  3  and  4  are  replaced  with  one  step: 
Build  m(7/!(x|Ot)); 


Algorithm  4  Mass_Mapping( I),  mass) 

Inputs:  D — input  data;  mass — a  function  consists  of  t  mass  distributions. 
Output:  D' — a  set  of  mapped  instances  z,-  in  t  dimensions. 

1:  for  i  =  1  to  |D|  do 
2:  z,-  <—  mass (x,); 

3:  end  for 


Cl  The  first  component  constructs  a  number  of  mass  distributions  in  a  mass  space.  A  mass 
distribution  mass(xd,h\T> )  for  dimension  cl  in  the  original  feature  space  is  obtained 
using  our  proposed  one-dimensional  mass  estimation,  as  given  in  Definition  5.  A  to¬ 
tal  number  of  t  mass  distributions  is  generated  which  forms  mass(x)  ->  TV ,  where 
t  u.  This  procedure  is  given  in  Algorithm  3.  Multi-dimensional  mass  estimation 
m(Th(x\T>))  (replacing  one-dimensional  mass  estimation  mass(xd ,  h\T>))  can  be  used 
to  generate  the  mass  space  similarly;  see  note  in  Algorithm  3. 

C2  The  second  component  maps  the  data  set  D  in  the  original  space  of  u  dimensions  into 
a  new  data  set  D'  in  /-dimensional  mass  space  using  inass(x)  =  z.  This  procedure  is 
described  in  Algorithm  4. 

C3  The  third  component  employs  a  decision  rule  to  determine  the  final  outcome  for  the  task 
at  hand.  It  is  a  task-specific  decision  function  applied  to  z  in  the  new  mass  space. 

The  formalism  becomes  a  blueprint  for  different  tasks.  Components  Cl  and  C3  are 
mandatory  in  the  formalism,  but  component  C2  is  optional,  depending  on  the  task. 

For  information  retrieval  and  regression,  the  task-specific  C3  procedure  is  simply  using 
an  existing  algorithm  for  the  task  except  that  the  process  is  carried  out  in  the  new  mapped 
mass  space,  instead  of  the  original  space.  The  MassSpace  procedure  is  given  in  Algo¬ 
rithm  5. 

The  task-specific  C3  procedure  for  anomaly  detection  is  shown  in  steps  2-5  in  Algo¬ 
rithm  6:  MassAD.  Note  that  anomaly  detection  requires  Cl  and  C3  only;  whereas  the  other 
two  tasks  require  all  three  components. 

In  our  experiments  described  in  the  next  section,  the  mapping  from  u  dimensions  to 
t  dimensions  using  Algorithm  3  is  carried  out  one  dimension  at  a  time  when  using  one¬ 
dimensional  mass  estimation;  and  all  u  dimensions  at  a  time  when  using  multi-dimensional 
mass  estimation.  Each  such  mapping  produces  one  dimension  in  mass  space  and  is  repeated 
t  times  to  get  a  r-dimensional  mass  space.  Note  that  randomisation  gives  different  variations 
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Algorithm  5  Perform  task  in  MassSpacet D,  xjr,  h .  t ) 

Inputs:  D — input  data;  i jr — data  size  for  V:  li — level  of  mass  distribution;  t — number  of 
mass  distributions. 

Output:  Task-specific  model  in  mass  space. 

1:  iuass(-)  Mass_Estimation(D,  i jr,  h,  r); 

2:  D’  <r-  Mass_Mapping(D,  inass); 

3:  Perform  task  (information  retrieval  or  regression)  in  the  mapped  mass  space  using  D’\ 


Algorithm  6  for  Anomaly  Detection:  MassAD(D,  \l/.  h,  t ) 

Inputs:  D — input  data;  i jr — data  size  for  D\  h — level  of  mass  distribution;  t — number  of 
mass  distributions. 

Output:  Ranked  instances  in  D. 

1:  iuass(-)  Mass_Estimation(D,  ijr,  h,  f); 

2:  for  i  =  1  to  \D\  do 

3:  M,  <—  Average  of  t  masses  from  inass(x;); 

4:  end  for 

5:  Rank  instances  in  D  based  on  M,  ,  where  low  mass  denotes  anomalies  and  high  mass 
denotes  normal  points; 


to  each  of  the  t  mappings.  The  first  randomisation  occurs  at  step  2  in  Algorithm  3  in  selecting 
a  random  subset  of  data.  Additional  randomisation  is  applied  to  attribute  selection  at  step  3 
in  Algorithm  3  for  one-dimensional  mass  estimation,  or  at  step  4  in  Algorithm  2  for  multi¬ 
dimensional  mass  estimation. 


6  Experiments 

We  evaluate  the  performance  of  MassSpace  and  MassAD  for  three  tasks  in  the  following 
three  subsections.  We  denote  an  algorithm  A  using  one-dimensional  and  multi-dimensional 
mass  estimations  as  A'  and  A",  respectively. 

In  information  retrieval  and  regression  tasks,  the  mass  estimation  uses  i/r  =  8  and  t  = 
1000.  These  settings  are  obtained  by  examining  the  rank  correlation  as  shown  in  Fig.  4(b) — 
having  a  high  rank  correlation  between  mass(x,  1)  and  mass(x,  1| V).  Note  that  this  is  done 
before  any  method  is  applied,  and  no  further  tuning  of  the  parameters  is  carried  out  after  this 
step.  In  anomaly  detection  tasks,  i/r  =  256  and  t  =  100  are  used  so  that  they  are  comparable 
to  those  used  in  a  benchmark  method  for  a  fair  comparison.  In  all  tasks,  h  =  1  is  used  for 
one-dimensional  mass  estimation,  and  it  cannot  afford  to  use  a  high  h  because  of  its  high 
cost  0(\[rh).  h  =  \[r  is  used  for  multi-dimensional  mass  estimation  in  order  to  reduce  one 
parameter  setting. 

All  the  experiments  were  run  in  Matlab  and  conducted  on  a  Xeon  processor  which  ran  at 
2.66  GHz  and  with  48  GB  memory.  The  performance  of  each  method  was  measured  in  terms 
of  task-specific  performance  measure  and  runtime.  Paired  ?-tests  at  5  %  significance  level 
were  conducted  to  examine  whether  the  difference  in  performance  is  significant  between 
two  algorithms  under  comparison. 

Note  that  we  treated  information  retrieval  and  anomaly  detection  as  unsupervised  learn¬ 
ing  tasks.  Classes/labels  in  the  original  data  were  used  as  ground  truth  for  evaluation  of 
performance  only;  they  were  not  used  in  building  mass  distributions.  In  regression,  only  the 
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training  set  was  used  to  build  mass  distributions  in  step  1  of  Algorithm  5;  the  mapping  in 
step  2  was  conducted  for  both  the  training  and  testing  sets. 

6.1  Content-based  image  retrieval 

We  use  a  Content-Based  Image  Retrieval  (CBIR)  task  as  an  example  of  information  retrieval. 
The  MassSpace  approach  is  compared  with  three  state-of-the-art  CBIR  methods  that  deal 
with  relevance  feedbacks:  a  manifold  based  method  MRBIR  (He  et  al.  2004),  and  two  recent 
techniques  for  improving  similarity  calculation,  i.e.,  Qsim  (Zhou  and  Dai  2006)  and  instR 
(Giacinto  and  Roli  2005);  and  we  employ  the  Euclidean  distance  to  measure  the  similarity 
between  instances  in  these  two  methods.  The  default  parameter  settings  are  used  for  all  these 
methods. 

Our  experiments  were  conducted  using  the  COREL  image  database  (Zhou  et  al.  2006)  of 
10000  images,  which  contains  100  categories  and  each  category  has  100  images.  Each  image 
is  represented  by  a  67-dimensional  feature  vector,  which  consists  of  11  shape,  24  texture 
and  32  color  features.  To  test  the  performance,  we  randomly  selected  5  images  from  each 
category  to  serve  as  the  queries.  For  a  query,  the  images  within  the  same  category  were 
regarded  as  relevant  and  the  rest  were  irrelevant.  For  each  query,  we  continued  to  perform 
up  to  5  rounds  of  relevance  feedback.  In  each  round,  2  positive  and  2  negative  feedbacks 
were  provided.  This  relevance  feedback  process  was  also  repeated  5  times  with  5  different 
series  of  feedbacks.  Finally,  the  average  results  with  one  query  and  in  different  feedback 
rounds  were  recorded.  The  retrieval  performance  was  measured  in  terms  of  Break-Even- 
Point  (BEP)  (Zhou  and  Dai  2006;  Zhou  et  al.  2006)  of  the  precision-recall  curve.  The  online 
processing  time  reported  is  the  time  required  in  each  method  for  a  query  plus  the  stated 
number  of  feedback  rounds.  The  reported  result  is  an  average  over  5  x  100  runs  for  query 
only;  and  an  average  over  5  x  100  x  5  runs  for  query  plus  feedbacks.  The  offline  costs 
of  constructing  the  one-dimensional  mass  estimation  and  the  mapping  of  10000  images 
were  0.27  and  0.32  seconds,  respectively.  The  multi-dimensional  mass  estimation  and  the 
corresponding  mapping  took  1.72  and  5.74  seconds,  respectively. 

The  results  are  presented  in  Table  2  where  the  retrieval  performance  better  than  that 
conducted  in  the  original  space  at  each  round  has  been  boldfaced.  The  results  are  grouped 
for  ease  of  comparison. 

The  BEP  results  clearly  show  that  the  MassSpace  approach  achieves  a  better  retrieval 
performance  than  that  using  the  original  space  in  all  three  methods  MRBIR,  Qsim  and  In¬ 
stR,  for  one  query  and  all  rounds  of  relevance  feedbacks.  Paired  t-tests  with  5  %  signifi¬ 
cance  level  also  indicate  that  the  MassSpace  approach  significantly  outperforms  each  of 
the  three  methods  in  all  experiments,  without  exception.  These  results  show  that  the  mass 
space  provides  useful  additional  information  that  is  hidden  in  the  original  space. 

The  results  also  show  that  the  multi-dimensional  mass  estimation  provides  better  in¬ 
formation  than  the  one-dimensional  mass  estimation — MRBIR",  Qsim"  and  InstR"  give 
better  retrieval  performance  than  MRBIR',  Qsim'  and  InstR',  respectively;  only  some  ex¬ 
ceptions  occur  in  the  higher  feedback  rounds  for  InstR',  with  minor  differences. 

The  processing  time  for  the  MassSpace  approach  is  expected  to  be  longer  than  each 
of  the  three  methods  because  the  number  of  dimensions  in  the  mass  space  is  significantly 
higher  than  those  in  the  original  space,  where  t  =  1000  and  u  =  67.  Despite  that.  Table  3 
shows  that  MRBIR",  MRBIR'  and  MRBIR  all  have  similar  level  of  runtime. 

Figure  8  shows  an  example  of  performance  for  InstR' — BEP  increases  as  t  increases 
until  it  reaches  a  plateau  at  some  t  value;  and  the  processing  time  is  linear  w.r.t.  the  number 
of  dimensions  of  the  mass  space,  t . 
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Table  2  CBIR  results  (in  BEP  x  10  2).  An  algorithm  A  using  one-dimensional  and  multi-dimensional  mass 
estimations  are  denoted  as  A'  and  A",  respectively.  Note  that  a  high  BEP  is  better  than  a  low  BEP 


mrbir" 

mrbir' 

MRBIR 

Qsim" 

Qsim' 

Qsim 

InstR" 

InstR' 

InstR 

One  query 

12.65 

10.70 

9.69 

12.38 

10.35 

7.78 

12.38 

10.35 

7.78 

Round  1 

16.58 

14.24 

12.72 

19.18 

15.46 

10.59 

13.88 

13.33 

9.40 

Round  2 

18.41 

16.05 

13.90 

21.98 

17.58 

11.81 

15.12 

14.95 

9.99 

Round  3 

19.69 

17.34 

14.75 

23.67 

18.71 

12.59 

16.19 

16.07 

10.36 

Round  4 

20.48 

18.20 

15.33 

24.65 

19.50 

13.16 

16.88 

16.93 

10.78 

Round  5 

21.15 

19.86 

15.71 

25.42 

19.96 

13.55 

17.49 

17.58 

11.05 

Table  3 

CBIR  results  (online  time  cost  in  seconds) 

mrbir" 

mrbir' 

MRBIR 

Qsim" 

Qsim' 

Qsim 

InstR" 

InstR' 

InstR 

One  query  0.714 

0.785 

0.364 

0.715 

0.822 

0.093 

0.715 

0.822 

0.093 

Round  1 

0.762 

0.893 

0.696 

0.207 

0.208 

0.035 

0.197 

0.198 

0.026 

Round  2 

0.763 

0.893 

0.696 

0.228 

0.231 

0.058 

0.200 

0.200 

0.028 

Round  3 

0.763 

0.893 

0.696 

0.257 

0.259 

0.086 

0.200 

0.200 

0.028 

Round  4 

0.764 

0.893 

0.696 

0.291 

0.294 

0.122 

0.200 

0.200 

0.028 

Round  5 

0.764 

0.893 

0.697 

0.335 

0.341 

0.167 

0.200 

0.200 

0.028 

6.2  Regression 

In  this  experiment,  we  compare  support  vector  regression  (Vapnik  2000)  that  employs  the 
original  space  (SVR)  with  that  employs  the  mapped  mass  space  (SVR"  and  SVR').  SVR  is 
the  e-SVR  algorithm  with  RBF  kernel,  implemented  by  LIBSVM  (Chang  and  Lin  2001). 
SVR  is  chosen  here  because  it  is  one  of  the  top  performing  models. 

We  utilize  five  benchmark  data  sets  including  four  selected  from  UCI  repository  (Asun¬ 
cion  and  Newman  2007)  and  one  earthquake  data  (Simonoff  1996)  from  www.cs.waikato.ac. 
nz/ml/weka/distribution.  The  data  sizes  are  shown  in  the  second  column  of  Table  4.  We  only 
considered  data  sets  with  more  than  1000  data  points,  that  contained  only  real- valued  at- 
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Table  4  Regression  results  (the 
smaller  the  better  for  MSE) 


Data  size 

MSE(xlO-2) 

W/D/L 

svr" 

svr' 

SVR 

svr" 

svr' 

tic 

9822 

5.56 

5.58 

5.62 

18/0/2 

17/0/3 

wine_white 

4898 

1.08 

1.21 

1.36 

20/0/0 

20/0/0 

quake 

2178 

2.87 

2.86 

2.92 

17/0/3 

18/0/2 

wine_red 

1599 

1.50 

1.62 

1.62 

19/0/1 

1 1/0/9 

concrete 

1030 

0.28 

0.33 

0.57 

20/0/0 

20/0/0 

Table  5  Regression  results  (time  in  seconds) 


#Dimension 

Processing  time 

Factor  increase 

svr" 

SVR' 

SVR 

time(SVR") 

time(SVR') 

#dimension 

tic 

85 

23.4 

26.6 

11.9 

2.0 

2.2 

12 

wine_white 

11 

8.2 

9.2 

4.2 

2.0 

2.2 

91 

quake 

3 

2.5 

3.4 

1.0 

2.5 

3.4 

333 

wine_red 

11 

1.7 

2.6 

1.0 

1.6 

2.5 

91 

concrete 

8 

1.2 

2.3 

0.9 

1.3 

2.6 

125 

tributes  and  that  had  no  missing  values,  we  did  this  in  order  to  get  a  result  with  a  higher 
confidence  than  those  obtained  from  small  data  sets. 

In  each  data  set,  we  randomly  sampled  two-thirds  of  the  instances  for  training  and 
the  remaining  one-third  for  testing.  This  was  repeated  20  times  and  we  report  the  aver¬ 
age  result  of  these  20  runs.  The  data  set,  whether  in  the  original  space  or  the  mass  space, 
was  min-max  normalized  before  an  e-SVR  model  was  trained.  To  select  optimal  param¬ 
eters  for  the  e-SVR  algorithm,  we  conducted  a  5-fold  cross  validation  based  on  mean 
squared  error  using  the  training  set  only.  The  kernel  parameter  y  was  searched  in  the  range 

{2-15, 2~13,  2-11, _ 23, 25};  the  regularization  parameter  C  in  the  range  {0.1,  1,  10},  and 

e  in  the  range  {0.01,0.05,0.1}.  We  measured  regression  performance  in  terms  of  mean 
squared  error  (MSE)  and  runtime  in  seconds.  The  runtime  reported  is  the  runtime  for  SVR 
only.  The  total  cost  of  mass  estimation  (from  the  training  set)  and  mapping  (of  training  and 
testing  sets)  in  the  largest  data  set,  tic,  was  1.8  seconds  for  one-dimensional  mass  estimation, 
and  8.5  seconds  for  multi-dimensional  mass  estimation.  The  cost  of  normalisation  and  the 
parameter  search  using  5-fold  cross-validation  was  not  included  in  the  reported  result  for  all 
SVR",  SVR'  and  SVR. 

The  result  is  presented  in  Table  4.  SVR'  performs  significantly  better  than  SVR  in  all 
data  sets  in  MSE  measure;  the  only  exception  is  in  the  wine_red  data  set.  SVR"  performs 
significantly  better  than  SVR  in  all  data  sets,  without  exceptions.  SVR"  generally  performs 
better  than  SVR'. 

Although  both  SVR"  and  SVR'  take  more  time  to  run  because  each  of  them  runs  on  the 
data  with  a  significantly  higher  dimension,  yet  the  factor  of  increase  in  time  (shown  in  the 
last  three  columns  of  Table  5)  ranges  from  1 .3  to  3.4  only,  when  the  factor  of  increase  in  the 
number  of  dimensions  ranges  from  12  to  over  300.  This  is  because  the  time  complexity  in 
the  key  optimisation  process  in  SVR  is  not  dependent  on  the  number  of  dimensions. 
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Table  6  Data  characteristics  of  the  data  sets  in  anomaly  detection  tasks.  The  percentage  in  brackets  indicates 


the  percentage  of  anomalies 

Data  size 

#Dimension 

Anomaly  class 

Http 

567497 

3 

Attack  (0.4  %) 

Forest 

286048 

10 

Class  4  (0.9  %)  vs  class  2 

Mulcross 

262144 

4 

2  clusters  (10  %) 

Smtp 

95156 

3 

Attack  (0.03  %) 

Shuttle 

49097 

9 

Classes  2,  3,  5,  6,  7  (7  %)  vs  class  1 

Mammography 

11183 

6 

Class  1  (2  %) 

Annthyroid 

7200 

6 

Classes  1,  2  (7  %) 

Satellite 

6435 

36 

3  Smallest  classes  (32  %) 

6.3  Anomaly  detection 


This  experiment  compares  MassAD  with  four  state-of-the-art  anomaly  detectors:  isolation 
forest  or  i Forest  (Liu  et  al.  2008),  a  distance-based  method  ORCA  (Bay  and  Schwabacher 
2003),  a  density-based  method  LOF  (Breunig  et  al.  2000),  and  one-class  support  vector 
machine  (or  1-SVM)  (Scholkopf  et  al.  2000).  MassAD  was  built  with  t  =  100  and  i/r  =  256, 
the  same  default  settings  as  used  in  iForest  (Liu  et  al.  2008),  which  also  employed  a 
multi-model  approach.  The  parameter  settings  employed  for  ORCA,  LOF  and  1-SVM  were 
as  stated  by  Liu  et  al.  (2008). 

All  the  methods  were  tested  on  the  eight  largest  data  sets  used  by  Liu  et  al.  (2008).  The 
data  characteristics  are  summarized  in  Table  6,  which  include  one  anomaly  data  generator 
Mulcross  (Rocke  and  Woodruff  1996)  and  the  other  seven  are  from  UCI  repository  (Asun¬ 
cion  and  Newman  2007).  The  performance  was  evaluated  in  terms  of  averaged  AUC  (area 
under  ROC  curve)  and  processing  time  (a  total  of  training  time  and  testing  time)  over  ten 
runs  (following  Liu  et  al.  2008). 

Mas  sAD  and  iForest  were  implemented  in  Matlab  and  tested  on  a  Xeon  processor  ran 
at  2.66  GHz.  LOF  was  written  in  Java  in  ELKI  platform  version  0.4  (Achtert  et  al.  2008); 
and  ORCA  was  written  in  C++  (www.stephenbay.net/orca/).  The  results  for  ORCA,  LOF 
and  1-SVM  were  conducted  using  the  same  experimental  setting  but  on  a  slightly  slower 
2.3  GHz  machine,  the  same  machine  used  by  Liu  et  al.  (2008). 

The  AUC  values  of  all  the  compared  methods  are  presented  in  Table  7  where  the  figures 
boldfaced  are  the  best  performance  for  each  data  set.  The  results  show  that  MassAD  using 
the  multi-dimensional  mass  estimation  achieves  the  best  performance  in  four  data  sets,  and 
close  to  the  best  (the  difference  which  is  less  than  0.03  AUC)  in  two  data  sets;  MassAD 
using  the  one-dimensional  mass  estimation  achieves  the  best  performance  in  three  data  sets, 
and  close  to  the  best  in  one  data  set.  iForest  performs  best  in  four  data  sets.  The  results 
are  close  for  these  three  algorithms  because  they  share  many  similarities  (see  Sect.  9  for 
details). 

Again,  the  multi-dimensional  version  of  MassAD  generally  performs  better  than  the  one¬ 
dimensional  version,  with  five  wins,  one  draw  and  two  losses.  Most  importantly,  the  worst 
performance  in  the  Mulcross  data  set  can  be  easily  ‘corrected’  using  a  better  parameter 
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Table  7  AUC  values  for 
anomaly  detection 


Http 

1.00 

1.00 

1.00 

0.36 

0.44 

0.90 

Forest 

0.90 

0.92 

0.87 

0.83 

0.56 

0.90 

Mulcross 

0.26 

0.99 

0.96 

0.33 

0.59 

0.59 

Smtp 

0.91 

0.86 

0.88 

0.87 

0.32 

0.78 

Shuttle 

1.00 

0.99 

1.00 

0.55 

0.55 

0.79 

Mammography 

0.86 

0.37 

0.87 

0.77 

0.71 

0.65 

Annthyroid 

0.75 

0.71 

0.82 

0.68 

0.72 

0.63 

Satellite 

0.77 

0.62 

0.71 

0.65 

0.52 

0.61 

Table  8  Runtime  (second)  for 

anomaly  detection  Mass  AD _  iForest  ORCA  LOF  1-SVM 

Mass”  Mass' 


Http 

168 

18 

74 

9487 

18913 

35872 

Forest 

63 

10 

39 

6995 

10853 

9738 

Mulcross 

52 

10 

38 

2512 

5432 

7343 

Smtp 

27 

4 

13 

267 

540 

987 

Shuttle 

20 

3 

8 

157 

368 

333 

Mammography 

21 

1 

3 

4 

39 

11 

Annthyroid 

7 

1 

3 

2 

9 

4 

Satellite 

13 

1 

3 

9 

10 

9 

MassAD  iForest  ORCA  LOF  1-SVM 

Mass”  Mass' 


setting — by  using  t/r  =  8,  instead  of  256,  the  multi-dimensional  version  of  MassAD  im¬ 
proves  its  detection  performance  from  0.26  to  1.00  in  terms  of  AUC.4 

It  is  also  noteworthy  that  the  multi-dimensional  MassAD  significantly  outperforms  the 
traditional  density-based,  distance-based  and  SVM  anomaly  detectors  in  all  data  sets,  except 
two:  one  in  Annthyroid  when  compared  to  ORCA;  the  poor  performance  in  Mulcross  was 
discussed  earlier.  The  above  observations  validate  the  effectiveness  of  our  proposed  mass 
estimation  on  anomaly  detection  tasks. 

Table  8  shows  the  runtime  result.  Although  Mas  sAD  was  run  on  a  slightly  faster  machine, 
the  result  still  shows  that  it  has  a  significant  advantage  in  term  of  processing  time  over  ORCA, 
LOF  and  1-SVM.  The  comparison  with  iForest  is  presented  in  Table  9  with  a  breakdown 
of  training  time  and  testing  time.  Note  that  one-dimensional  MassAD  took  the  same  time 
as  iForest  in  training,  but  it  only  took  about  one-tenth  of  the  time  required  by  iForest 
in  testing.  On  the  other  hand,  the  multi-dimensional  MassAD  took  slightly  more  time  than 
iForest  in  training,  but  it  took  up  to  three  times  the  time  required  by  iForest  in  testing. 

The  time  and  space  complexities  for  five  anomaly  detection  methods  are  given  in  Ta¬ 
ble  10.  The  one-dimensional  MassAD  and  iForest  have  the  best  time  and  space  com¬ 
plexities  due  to  their  ability  to  use  small  \jr  n  and  h  =  1.  Note  that  the  one-dimensional 
MassAD  ( h  =  1)  is  faster  by  a  factor  of  log((/r  =  256)  =  8  which  shows  up  in  the  test¬ 
ing  time — ten  times  faster  than  iForest  given  in  Table  9.  The  training  time  disadvan- 


4Mulcross  produces  anomaly  clusters  rather  than  scattered  anomalies.  Detecting  anomaly  clusters  are  more 
effective  using  a  low  i js  setting  when  the  multi-dimensional  version  of  MassAD  is  employed. 
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Table  9  Training  time  and  testing  time  (second)  for  MassAD  and  iForest,  using  t  =  100  and  xjt  =  256 


Training  time 

Testing  time 

MassAD 

iForest 

MassAD 

iForest 

Mass" 

Mass' 

Mass" 

Mass' 

Http 

16.2 

14.3 

14.4 

151.8 

3.3 

59.6 

Forest 

10.3 

8.2 

8.6 

53.1 

2.0 

30.8 

Mulcross 

9.1 

7.9 

8.1 

42.8 

2.1 

29.4 

Smtp 

5.4 

3.9 

3.5 

21.9 

0.6 

9.9 

Shuttle 

6.1 

3.1 

2.8 

14.1 

0.3 

5.6 

Mammography 

8.4 

1.3 

1.2 

12.8 

0.1 

1.8 

Annthyroid 

3.1 

1.3 

1.1 

3.4 

0.1 

1.5 

Satellite 

6.6 

1.2 

1.6 

5.9 

0.0 

1.9 

Table  10  A  comparison  of  time 
and  space  complexities.  The  time 
complexity  includes  both  training 
and  testing,  n  is  the  given  data  set 
size  and  u  is  the  number  of 
dimensions.  For  MassAD  and 
iForest,  the  first  part  of  the 
summation  is  the  training  time 
and  the  second  the  testing  time 


Time  complexity 

Space  complexity 

MassAD  (multi-dimensional)  0(t(\fs  +  n)h) 

0(tfh) 

MassAD  (one-dimensional) 

0(t(fh+l  +/!)) 

Oitx/r) 

iForest 

0(t (x[r  +  n )  •  log(\/f ))  0((t/r  •  !og(tjj)) 

ORCA 

0(un  ■  login)) 

0(un) 

LOF 

0(un 2) 

0{un) 

tage,  compared  to  iForest,  did  not  show  up  because  of  small  x//.  The  one-dimensional 
MassAD  also  has  an  advantage  over  iForest  in  space  complexity  by  a  factor  of  log(t jr). 
The  multi-dimensional  MassAD  has  similar  order  of  worst-case  time  and  space  complexities 
as  iForest,  though  it  might  have  a  larger  constant. 

In  contrast  to  ORCA  and  LOF  (distance-based  and  density-based  methods),  the  time  com¬ 
plexity  (and  the  space  complexity)  for  both  MassAD  and  iForest  are  independent  of  the 
number  of  dimension  it . 

6.4  Constant  time  and  space  complexities 

In  this  section,  we  show  that  mass(x,  h\V)  (in  step  4  of  Algorithm  3)  takes  only  constant 
time,  regardless  of  the  given  data  size  n,  when  the  algorithmic  parameters  are  fixed.  Table  1 1 
reports  the  runtime  time  for  sampling  (to  get  a  random  sample  of  size  i jr  from  the  given  data 
set — steps  2  and  3  of  Algorithm  3)  and  the  runtime  for  one-dimensional  mass  estimation — 
to  construct  mass(x ,  h \V)  t  times,  for  five  data  sets  which  include  the  largest  and  smallest 
data  sets  in  regression  and  anomaly  detection  tasks. 

The  results  show  that  the  sampling  time  increased  linearly  with  the  size  of  the  given 
data  set,  and  it  took  significantly  longer  (in  the  largest  data  set)  than  the  time  to  construct 
the  mass  distribution — which  was  constant,  regardless  of  the  given  data  size.  Note  that  the 
training  time  provided  in  Table  9  includes  both  the  sampling  time  and  mass  estimation  time, 
and  it  is  dominated  by  the  sampling  time  for  large  data  sets. 

The  memory  required  for  each  construction  of  mass(x,  h\T>)  is  to  store  one  lookup  table 
of  size  i/r  which  is  constant. 

The  constant  time  and  space  complexities  apply  to  multi-dimensional  mass  estimation 
too. 
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Table  11  Runtime  (second)  for 
sampling,  mass(x,  1|  T>)  and 
mass(x ,  3 1  T>),  where  t  =  1000 
and  t/f  =  8 


Data  size 

Sampling 

mass(x,  1 1  T>) 

mass(x,  3|  T>) 

Http 

567497 

138.30 

0.33 

10.96 

Shuttle 

49097 

16.16 

0.39 

10.97 

COREL 

10000 

1.23 

0.27 

11.03 

tic 

9822 

1.09 

0.43 

11.14 

concrete 

1030 

0.18 

0.31 

10.95 

Fig.  9  Runtime  comparison: 
One-dimensional  mass 
estimation  versus 
multi-dimensional  mass 
estimation  for  different  values  of 
h  in  the  COREL  data  set,  where 
both  are  using  i/r  =  8  and 
1  =  1000.  In  this  experiment,  we 
set  h  to  the  required  value  for 
multi-dimensional  mass 
estimation,  rather  than  h  —  i// 
which  was  used  in  all 
experiments  reported  in  the 
previous  sections 


6.5  Runtime  comparison  between  one-dimensional  and  multi-dimensional  mass 
estimations 

In  terms  of  runtime,  the  comparison  so  far  in  the  experiments  might  give  the  impression  that 
multi-dimensional  mass  estimation  is  worse  than  one-dimensional  mass  estimation.  In  fact, 
the  opposite  is  true  because  the  above  results  are  obtained  from  h  =  1  for  one-dimensional 
mass  estimation  and  h  =  i[r  for  multi-dimensional  mass  estimation.  Figure  9  shows  the  head- 
to-head  comparison  for  different  h  values  in  the  COREL  data  set.  When  h  increases  from 
1  to  5,  the  runtime  for  the  one-dimensional  mass  estimation  increases  by  a  factor  of  33.  In 
contrast,  the  runtime  for  the  multi-dimensional  mass  estimation  increases  by  a  factor  of  1 .5 
only. 

6.6  Summary 

The  above  results  in  all  three  tasks  show  that  the  orderings  provided  by  mass  distributions 
deliver  additional  information  about  the  data  that  would  otherwise  hidden  in  the  original 
features.  The  additional  information,  which  accentuates  fringe  points  with  a  concave  func¬ 
tion  (or  an  approximation  to  a  concave  function  in  the  case  of  multi-dimensional  mass  esti¬ 
mation),  improves  the  task-specific  performance  significantly,  especially  in  the  information 
retrieval  and  regression  tasks. 

Using  Algorithm  5  for  the  information  retrieval  and  regression  tasks,  the  runtime  is  ex¬ 
pected  to  be  higher  because  the  new  space  has  much  higher  dimensions  than  the  original 
space  (t  u).  It  shall  be  noted  that  the  runtime  increase  (linearly  or  worse)  is  solely  a  char¬ 
acteristic  of  the  existing  algorithms  used,  and  is  not  due  to  the  mass  space  mapping  which 
has  constant  time  and  space  complexities. 
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Table  12  A  comparison  of  kernel  density  estimation  and  mass  estimation.  Kernel  density  estimation  requires 
two  parameter  settings:  kernel  function  K(-)  and  bandwidth  hw\  mass  estimation  has  one:  h 


Kernel  density!*)  =  ^  ELl 


mass(x,  h) 


mass i  (x ,  h-l)p(sf),  h  >  1 
E"=i  mi(x)p(.Si),  h  =  l 


We  believe  that  a  more  tailored  approach  that  better  integrates  the  information  provided 
by  mass  (into  the  C3  component  in  the  formalism)  for  a  specific  task  can  potentially  fur¬ 
ther  improve  the  current  level  of  performance  in  terms  of  either  task-specific  performance 
measure  or  runtime.  We  have  demonstrated  this  ‘direct’  application  using  Algorithm  6  for 
the  anomaly  detection  task,  in  which  MassAD  performs  equally  well  or  significantly  better 
than  four  state-of-the-art  methods  in  terms  of  task-specific  performance  measure,  and  the 
one-dimensional  mass  estimation  executes  faster  than  all  other  methods  in  terms  of  runtime. 

Why  does  one-dimensional  mapping  work  when  tackling  multi-dimensional  problems? 
We  conjecture  that  if  there  is  no  or  little  interaction  between  features,  then  the  one¬ 
dimensional  mapping  will  work  because  the  ordering  that  accentuates  the  fringe  points 
for  each  original  dimension  making  it  easy  for  existing  algorithms  to  exploit.  When  there 
are  strong  interactions  between  features,  then  one-dimensional  mapping  might  not  achieve 
good  results.  Indeed,  our  results  in  all  three  tasks  show  that  multi-dimensional  mass  esti¬ 
mation  does  perform  better  than  one-dimensional  mass  estimation  in  general,  in  terms  of 
task-specific  performance  measures. 

The  ensemble  method  for  mass  estimation  usually  needs  only  a  small  sample  to  build 
each  model  in  an  ensemble.  In  addition,  in  order  to  build  all  t  models  for  an  ensemble,  fi/r 
could  be  more  than  n  when  \[r  >  n/t. 

The  key  limitation  of  the  one-dimensional  mass  estimation  is  its  high  cost  when  a  high 
value  of  h  is  applied.  This  can  be  avoided  by  implementing  it  using  a  tree  structure  rather 
than  a  lookup  table,  as  we  have  done  using  Half-Space  Trees  which  reduces  the  time  com¬ 
plexity  to  0(th(\f  +  n))  from  0(t(xjrh+1  +  n)). 


7  Relation  to  kernel  density  estimation 

A  comparison  of  mass  estimation  and  kernel  density  estimation  is  provided  in  Table  12. 

Like  kernel  estimation,  mass  estimation  at  each  point  is  computed  through  a  summation 

of  a  series  of  values  from  a  mass  base  function  equivalent  to  a  kernel  function  K(-). 

The  two  methods  differ  in  the  following  ways: 

•  Aim:  Kernel  estimation  is  aimed  to  do  probability  density  estimation;  whereas  mass  esti¬ 
mation  is  to  estimate  an  order  from  the  core  points  to  the  fringe  points. 

•  Kernel  function:  While  kernel  estimation  can  use  different  kernel  functions  for  probability 
density  estimation;  we  doubt  that  mass  estimation  requires  a  different  base  function  for 
two  reasons.  First,  a  more  sophisticated  function  is  unlikely  to  provide  a  better  ordering 
than  a  simple  rectangular  function.  Second,  the  rectangular  function  keeps  the  computa¬ 
tion  simple  and  fast.  In  addition,  a  kernel  function  must  be  fixed  (i.e.,  having  user-defined 
values  for  its  parameters);  e.g.,  the  rectangular  kernel  function  has  fixed  width  or  fixed  per 
unit  size.  But  the  rectangular  function  used  in  mass  has  no  parameter  and  no  fixed  width. 

•  Sample  size:  Kernel  estimation  or  other  density  estimation  methods  require  a  large  sample 
size  in  order  to  estimate  the  probability  accurately  (Duda  et  al.  2001).  Mass  estimation 
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Table  13  CBIR  results  (in  BEP  x  10-2) 

(a)  Compare  with  Qsim^-  (using  kernel  density  estimation),  Qsim^  (using  data  depth),  Qsim^  (using 
local  data  depth) 

Qsim" 

Qsim' 

QsiirA 

Qsim° 

QsimLO 

Qsim 

One  query 

12.38 

10.35 

2.90 

10.39 

7.60 

7.78 

Round  1 

19.18 

15.46 

3.01 

15.02 

10.95 

10.59 

Round  2 

21.98 

17.58 

2.74 

17.16 

12.50 

11.81 

Round  3 

23.67 

18.71 

2.54 

18.37 

13.42 

12.59 

Round  4 

24.65 

19.50 

2.42 

19.20 

14.03 

13.16 

Round  5 

25.42 

19.96 

2.34 

19.74 

14.36 

13.55 

(b)  Compare  with  InstR^,  InstR°  and  InstRio 

InstR" 

InstR' 

InstR^ 

InstRD 

InstRLO 

InstR 

One  query 

12.38 

10.35 

2.90 

10.39 

7.60 

7.78 

Round  1 

13.88 

13.33 

2.91 

13.05 

8.71 

9.40 

Round  2 

15.12 

14.95 

2.55 

14.73 

9.68 

9.99 

Round  3 

16.19 

16.07 

2.25 

15.98 

10.28 

10.36 

Round  4 

16.88 

16.93 

2.06 

16.82 

10.78 

10.78 

Round  5 

17.49 

17.58 

1.99 

17.50 

11.17 

11.05 

using  mass(x,  h\T>)  needs  only  a  small  sample  size  in  an  ensemble  to  accurately  estimate 
the  ordering. 

Here  we  present  the  results  using  a  Gaussian  kernel  density  estimation,  replacing  the 
one-dimensional  mass  estimation,  using  the  same  subsample  size  in  an  ensemble  approach. 
The  bandwidth  parameter  is  set  to  be  the  standard  deviation  of  the  subsample;  and  all  the 
other  parameters  are  the  same. 

The  results  for  information  retrieval  and  anomaly  detection  are  provided  in  Tables  13 
and  15.  Compared  to  mass,  density  performed  significantly  worse  in  information  retrieval 
tasks  in  all  experiments  using  Qsim  and  InstR,  denoted  as  Qsim^  and  InstR^,  respec¬ 
tively.  They  were  even  worse  than  those  run  in  the  original  space.  In  anomaly  detection, 
DensityAD,  which  used  a  Gaussian  kernel  density  estimation,  performed  significantly 
worse  than  MassAD  in  six  out  of  eight  data  sets  in  the  anomaly  detection  tasks,  and  better 
in  the  other  two  data  sets. 


8  Relation  to  data  depth 

There  is  a  close  relationship  between  the  proposed  mass  and  data  depth  (Liu  et  al.  1999): 
they  both  delineate  the  centrality  of  a  data  cloud  (as  opposed  to  compactness  in  the  case  of 
the  density  measure).  The  properties  common  to  both  measures  are:  (a)  the  centre  of  a  data 
cloud  has  the  maximum  value  of  the  measure;  (b)  an  ordering  from  the  centre  (having  the 
maximum  value)  to  the  fringe  points  (having  the  minimum  values). 

However,  there  are  two  key  differences.  First,  not  until  recently  (see  Agostinelli  and 
Romanazzi  2011)  data  depth  always  models  a  given  data  with  one  centre,  regardless  whether 
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Table  14  CBIR  results  (online  time  cost  in  seconds) 

(a)  Compare  with  Qsim^ ,  Qsim^,  Qsim^ 

Qsim" 

Qsim' 

Qsim^ 

Qsim® 

QsimL® 

Qsim 

One  query 

0.715 

0.822 

0.820 

0.840 

0.829 

0.093 

Round  1 

0.207 

0.208 

0.224 

0.237 

0.226 

0.035 

Round  2 

0.228 

0.231 

0.279 

0.288 

0.276 

0.058 

Round  3 

0.257 

0.259 

0.348 

0.355 

0.343 

0.086 

Round  4 

0.291 

0.294 

0.435 

0.438 

0.425 

0.122 

Round  5 

0.335 

0.341 

0.547 

0.543 

0.531 

0.167 

(b)  Compare  with  InstR^, 

InstR®  and  InstR1® 

InstR" 

InstR' 

InstRx 

InstR® 

InstRL® 

InstR 

One  query 

0.715 

0.822 

0.820 

0.840 

0.829 

0.093 

Round  1 

0.197 

0.198 

0.203 

0.215 

0.206 

0.026 

Round  2 

0.200 

0.200 

0.205 

0.216 

0.206 

0.028 

Round  3 

0.200 

0.200 

0.206 

0.217 

0.207 

0.028 

Round  4 

0.200 

0.200 

0.207 

0.218 

0.208 

0.028 

Round  5 

0.200 

0.200 

0.207 

0.218 

0.208 

0.028 

Table  15  Anomaly  detection: 

Mass AD  vs DensityAD  and  MassAD _  DensityAD  DepthAD _ 

DepthAD  (AUC)  Mass"  Mass'  Depth  LDepth 


Http 

1.00 

1.00 

0.99 

0.98 

0.52 

Forest 

0.90 

0.92 

0.70 

0.85 

0.49 

Mulcross 

0.26 

0.99 

1.00 

0.99 

0.93 

Smtp 

0.91 

0.86 

0.59 

0.92 

0.93 

Shuttle 

1.00 

0.99 

0.90 

0.87 

0.72 

Mammography 

0.86 

0.37 

0.27 

0.36 

0.79 

Annthyroid 

0.75 

0.71 

0.80 

0.58 

0.86 

Satellite 

0.77 

0.62 

0.61 

0.59 

0.69 

the  data  is  unimodal  or  multi-modal;  whereas  mass  can  model  both  unimodal  and  multi¬ 
modal  data  by  setting  h  =  1  or  h  >  1.  Local  data  depth  (Agostinelli  and  Romanazzi  2011) 
has  a  parameter  (r)  which  allows  it  to  model  multi-modal  data  as  well  as  unimodal  data. 
However,  the  performance  of  local  data  depth  appears  to  be  sensitive  to  the  setting  of  r  (see 
a  discussion  of  the  comparison  below).  In  contrast,  a  single  setting  of  h  in  mass  estimation 
had  produced  good  task-specific  performance  in  three  different  tasks  in  our  experiments. 

Second,  mass  is  a  simple  and  straightforward  measure,  and  has  efficient  estimation  meth¬ 
ods  based  on  axis-parallel  partitions  only.  Data  depth  has  many  different  definitions,  depend¬ 
ing  on  the  construct  used  to  define  depth.  The  constructs  could  be  Mahalanobis,  Convex 
Hull,  simplicial,  halfspace  and  so  on  (Liu  et  al.  1999),  all  of  which  are  expensive  to  compute 
(Aloupis  2006) — this  has  been  the  main  obstacle  in  applying  data  depth  to  real  applications 
in  multi-dimensional  problems.  For  example,  Ruts  and  Rousseeuw  (1996)  compute  the  con¬ 
tour  of  data  depth  of  a  data  cloud  for  visualization,  and  employ  depth  as  the  anomaly  score  to 


4T  Springer 


Mach  Learn 


identify  anomalies.  Because  of  its  computational  cost,  it  is  limited  to  small  data  size  only.  In 
contrast  to  the  axis-parallel  partitions  used  in  mass  estimation,  halfspace  data  depth5  (Tukey 
1975),  for  example,  requires  to  consider  all  halfspaces  which  demands  high  computational 
time  and  space. 

To  provide  a  comparison,  we  replace  the  one-dimensional  mass  estimation  (defined 
in  Algorithm  3)  with  data  depth  (defined  by  simplicial  depth  Liu  et  al.  1999)  and  local 
data  depth  (defined  by  simplicial  local  depth  Agostinelli  and  Romanazzi  2011).  We  repeat 
the  experiments  by  employing  both  the  data  depth  and  local  data  depth  implementation 
in  R  by  Agostinelli  and  Romanazzi  (2011)  (accessible  from  r-forge.r-project.org/projects/ 
localdepth).  Both  data  depths  are  carried  out  in  the  same  approach  by  using  sample  size 
\jr  to  build  each  of  the  t  models  in  an  ensemble.6 *  The  number  of  simplices  used  to  do  the 
empirical  estimation  is  set  to  10000  for  all  runs.  Default  settings  are  used  for  all  other  pa¬ 
rameters  (i.e.,  the  membership  of  a  data  point  in  simplices  is  evaluated  in  the  “exact”  mode 
rather  than  the  approximate  mode,  and  the  tolerance  parameter  is  fixed  to  10~9).  Note  that 
local  depth  uses  an  additional  parameter  r  to  select  candidate  simplices,  where  a  simplex 
having  volume  larger  than  r  is  excluded  from  consideration.  As  the  performance  of  local 
depth  is  sensitive  to  r,  we  employ  the  quantile  order  of  r  of  10  %,  the  low  value  of  the  range 
10  %-30  %  suggested  by  Agostinelli  and  Romanazzi  (2011).  Because  both  data  depth  and 
local  data  depth  are  estimated  using  the  same  procedure,  their  runtimes  are  the  same. 

The  task-specific  performance  result  for  information  retrieval  is  provided  in  Table  13. 
Note  that  local  data  depth  could  produce  worse  retrieval  results  than  those  in  the  original 
feature  space.  Data  depth  performed  close  to  that  achieved  by  the  one-dimensional  mass 
estimation,  but  it  was  significantly  worse  than  the  multi-dimensional  mass  estimation. 

Figure  10  shows  a  scale  up  test  in  the  information  retrieval  task  using  Qsim  with  one 
query  and  feedback  round  5.  It  is  interesting  to  note  both  mass  and  data  depth  performed 
better  using  small  rather  than  large  subsampling  size.  As  expected,  KDE  produced  better 
results  with  increasing  subsampling  sizes;  but  even  with  xjr  =  8196  in  the  COREL  data  set 
of  10000  instances,  KDE  still  performed  the  worst  compared  to  mass  and  data  depth. 

Table  15  shows  the  result  in  anomaly  detection.  Data  depth  performed  worse  than  both 
versions  of  mass  estimation  in  six  out  of  eight  data  sets;  local  data  depth  performed  worse 
than  multi-dimensional  mass  estimation  in  five  out  of  eight  data  sets;  local  data  depth  versus 
one-dimensional  mass  estimation  have  four  wins  and  four  losses.  Note  that  though  local  data 
depth  achieved  the  best  result  in  two  data  sets,  it  also  produced  the  worst  in  three  data  sets 
which  were  significantly  worse  than  others  (in  http,  forest  and  shuttle). 

The  runtime  results  are  provided  in  Tables  14  and  16.  These  results  do  not  reveal  the  time 
complexities  of  the  algorithms  because  of  small  i/r  (and  the  CBIR  results  do  not  include  the 
offline  time  cost).  We  conducted  a  scale  up  test  using  the  Mulcross  data  set  by  increasing 


5Zuo  and  Serfling  (2000)  define  halfspace  data  depth  (HD)  of  a  point  x  in  7 Z"  w.r.t.  a  probability  measure  P 
on  TZU  as  the  minimum  probability  mass  carried  by  any  closed  halfspace  containing  x : 

HD(x\  P)  =  inf{P(H) :  H  a  closed  halfspace,  x  e  H\,  x  e  7Z“ 

In  the  language  of  data  depth,  the  one-dimensional  mass  estimation  may  be  interpreted  as  a  kind  of  average 
probability  mass  of  halfspaces  containing  x,  weighted  by  mass  covered  by  halfspace.  But  the  one-dimensional 
mass  estimation  defined  in  (1)  allows  mass  to  be  computed  by  a  summation  of  n  —  1  components  from  the 
given  data  set  of  size  n,  whereas  data  depth  does  not.  In  addition,  our  implementation  of  multi-dimensional 
mass  estimation  using  a  tree  structure  with  axis-parallel  splits  cannot  be  interpreted  using  any  of  the  constructs 
employed  by  data  depth. 

6Our  experiments  indicate  that  using  the  entire  data  set  to  estimate  data  depth  or  local  data  depth  produces 

worse  results  than  those  using  an  ensemble  approach.  This  result  is  shown  in  Appendix. 


42  Springer 


Mach  Learn 


Fig.  10  Scale  up  test  in  information  retrieval  using  Qs  im.  Subsampling  data  size  is  increased  from  =  8 
to  i/r  =  S196  in  the  COREL  image  data  set  containing  10000  instances.  The  same  experimental  setting  as 
reported  in  Sect.  6.1  is  used 


Table  16  Anomaly  detection: 
Mass  AD  vs  DensityAD  and 
DepthAD  (time  in  seconds) 


MassAD 

DensityAD 

DepthAD 

Mass" 

Mass' 

Depth 

LDepth 

Http 

168 

18 

17 

38 

38 

Forest 

63 

10 

10 

31 

31 

Mulcross 

52 

10 

10 

31 

31 

Smtp 

27 

10 

10 

26 

26 

Shuttle 

20 

4 

4 

25 

25 

Mammography 

21 

3 

3 

24 

24 

Annthyroid 

7 

1 

1 

23 

23 

Satellite 

13 

1 

1 

23 

23 

the  subsampling  size.  Using  the  runtime  at  ijr  =  8  as  the  base,  runtime  ratio  is  computed 
for  all  other  subsampling  sizes.  The  result  is  presented  in  Fig.  1 1.  It  shows  that  data  depth 
or  local  data  depth  had  the  worst  runtime  ratio  which  increased  its  runtime  58  times  when 
i jr  was  increased  by  a  factor  of  512.  The  multi-dimensional  mass  estimation  had  the  best 
runtime  ratio  of  6.6,  followed  by  KDE  (24)  and  one-dimensional  mass  estimation  (34)  when 
i/r  ratio  =  512.  The  actual  runtimes  in  seconds  were  126.6  (Mass"),  166.7  (KDE),  239.4 
(Mass'),  and  600.5  (Data  Depth).  This  result  is  not  surprising  because  the  multi-dimensional 
mass  estimation  has  time  complexity  0(t/r),  KDE  has  0( ty2),  the  one-dimensional  mass 
estimation  has  0{\lrh+l),  and  data  depth  using  simplices  has  0(t/r4)  (Aloupis  2006). 


9  Other  work  based  on  mass 

iForest  (Liu  et  al.  2008)  and  MassAD  share  some  common  features:  Both  are  ensem¬ 
ble  methods  which  build  t  models,  each  from  a  random  sample  of  size  \f/,  and  they  both 
combine  the  outputs  of  the  models  through  averaging  during  testing.  Although  iForest 
(Liu  et  al.  2008)  employs  path  length — an  instance  traverses  from  the  root  of  a  tree  to  its 
leaf — as  the  anomaly  score,  we  have  shown  that  the  path  length  used  in  iForest  is  in 
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Fig.  11  Scale  up  test  using  the 
Mulcross  data  set.  The  base 
subsampling  data  size  (i/r)  is  8; 
doubling  at  each  step  until 
l//  =  4096.  Each  point  in  the 
graph  is  an  average  over  1 0  runs 


subsample  size  ratio 


fact  a  proxy  to  mass  (see  Sect.  4.1  for  details).  In  other  words,  iForest  is  a  kind  of 
mass-based  method — that  is  why  MassAD  and  iForest  have  similar  detection  accuracy. 
Multi-dimensional  MassAD  has  the  closest  resemblance  to  iForest  because  of  the  use 
of  tree.  The  key  difference  is  that  MassAD  is  just  one  application  of  the  more  fundamental 
concept  of  mass  introduced  here,  whereas  iForest  is  for  anomaly  detection  only.  In  terms 
of  implementation,  the  key  difference  is  how  the  cut-off  value  is  selected  at  each  internal 
node  of  a  tree:  iForest  selects  the  cut-off  value  randomly  whereas  a  Half-Space  Tree 
selects  a  mid  point  deterministically  (see  step  5  in  Algorithm  2). 

How  easily  can  the  proposed  formalism  be  applied  to  other  tasks?  In  addition  to  the 
tasks  we  have  applied  in  this  paper,  we  have  applied  mass  estimation  ‘directly’ ,  using  the 
proposed  formalism,  to  solve  problems  in  content-based  multimedia  information  retrieval 
(Zhou  et  al.  2012)  and  clustering  (Ting  and  Wells  2010).  While  the  ‘indirect’  application 
is  straightforward  which  simply  uses  the  existing  algorithms  in  the  mass  space,  a  ‘direct’ 
application  requires  a  complete  rethink  of  the  problem  and  produces  a  totally  different  algo¬ 
rithm.  However,  this  rethink  of  a  problem  in  terms  of  mass  often  results  a  more  efficient  and 
sometimes  more  effective  algorithm  than  existing  algorithms.  We  provide  a  brief  description 
of  the  two  applications  in  the  following  two  paragraphs. 

In  addition  to  the  mass-space  mapping  we  have  shown  here  (i.e.,  components  Cl  and  C2), 
Zhou  et  al.  (2012)  present  a  content-based  information  retrieval  method  that  assigns  a  weight 
(based  on  iForest,  thus,  mass)  to  each  new  mapped  feature  w.r.t.  a  query;  and  then  it  ranks 
objects  in  the  database  according  to  their  weighted  average  feature  values  in  the  mapped 
space.  The  method  also  incorporates  relevance  feedback  which  modifies  the  ranking  based 
on  the  feedbacks  through  reweighted  features  in  the  mapped  space.  This  method  forms  the 
third  component  of  the  formalism  stated  in  Sect.  5.  This  ‘direct’  application  of  mass  has  been 
shown  to  be  significantly  better  than  the  ‘indirect’  approach  we  have  shown  in  Sect.  6.1,  in 
terms  of  both  task-specific  measure  and  runtime  (Zhou  et  al.  2012).  It  is  interesting  to  note 
that,  unlike  existing  retrieval  systems  which  rely  on  a  metric,  the  new  mass-based  method 
does  not  employ  a  metric — it  is  the  first  information  retrieval  system  that  does  not  use  a 
metric,  as  far  as  we  know. 

Ting  and  Wells  (2010)  use  a  variant  of  Half-Space  Trees  we  have  employed  here  and  ap¬ 
ply  mass  directly  to  solve  clustering  problems.  It  is  the  first  mass-based  clustering  algorithm, 
and  it  is  unique  because  it  does  not  use  any  distance  and  density  measure.  In  this  task,  like 
in  the  case  of  anomaly  detection,  only  two  components  are  required.  After  building  a  mass 
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model  (in  the  Cl  component),  the  C3  component  consists  of  linking  instances  with  non¬ 
zero  mass  connected  by  the  mass  model  and  making  each  group  of  connected  instances  a 
separate  cluster;  and  all  other  unconnected  instances  are  regarded  as  noise.  This  mass-based 
clustering  algorithm  has  been  shown  to  perform  equally  well  as  DBSCAN  (Ester  et  al.  1996) 
in  terms  of  clustering  performance,  but  it  runs  orders  of  magnitude  faster  (Ting  and  Wells 
2010). 

The  earlier  version  of  this  paper  (Ting  et  al.  2010)  establishes  the  properties  of  mass  esti¬ 
mation  in  the  one-dimensional  setting  only;  and  use  it  in  all  three  tasks.  This  paper  extends 
one-dimensional  mass  estimation  to  multi-dimensional  mass  estimation  using  the  same  ap¬ 
proach  as  described  by  Ting  and  Wells  (2010),  and  implements  multi-dimensional  mass 
estimation  using  Half-Space  Trees.  This  paper  reports  new  experiments  using  the  multi¬ 
dimensional  mass  estimation,  and  shows  the  advantage  of  using  multi-dimensional  mass 
estimation  over  one-dimensional  mass  estimation  in  the  three  tasks  reported  earlier  (Ting 
et  al.  2010).  These  related  works  show  that  mass  estimation  can  be  implemented  in  different 
ways  using  tree-based  or  non-tree-based  methods. 


10  Conclusions  and  future  work 

This  paper  makes  two  key  contributions.  First,  we  introduce  a  base  measure,  mass,  and 
delineate  its  three  properties:  (i)  a  mass  distribution  stipulates  an  ordering  from  core  points 
to  fringe  points  in  a  data  cloud;  (ii)  this  ordering  accentuates  the  fringe  points  with  a  concave 
function — a  property  that  can  be  easily  exploited  by  existing  algorithms  to  improve  their 
task-specific  performance;  and  (iii)  the  mass  estimation  methods  have  constant  time  and 
space  complexities.  Density  estimation  has  been  the  base  modelling  mechanism  employed 
in  many  techniques  thus  far.  Mass  estimation  introduced  here  provides  an  alternative  choice, 
and  it  is  better  suited  for  many  tasks  which  require  an  ordering  rather  than  probability  density 
estimation. 

Second,  we  present  a  mass-based  formalism  which  forms  a  basis  to  apply  mass  to  dif¬ 
ferent  tasks.  The  three  tasks  (i.e.,  information  retrieval,  regression  and  anomaly  detection) 
to  which  we  have  successfully  applied  are  just  examples  of  its  application.  Mass  estimation 
has  potentials  in  many  other  applications. 

There  are  potential  extensions  to  the  current  work.  First,  the  algorithms  for  the  three 
tasks  and  the  formalism  can  be  improved  or  extended  to  include  more  tasks.  Second,  be¬ 
cause  the  purposes  and  their  properties  differ,  mass  estimation  is  not  intended  to  replace 
density  estimation — it  is  thus  important  to  identify  areas  in  which  each  is  best  suited  for. 
This  will  ascertain  (i)  areas  in  which  density  has  been  a  mismatch,  unbeknown  up  to  now, 
and  (ii)  areas  in  which  mass  estimation  is  weak.  Third,  the  proposed  approach  to  multi¬ 
dimensional  mass  estimation  is  an  approximation  and  it  does  not  guarantee  concavity.  It 
will  be  interesting  to  explore  a  version  that  has  such  a  guarantee  and  to  examine  whether  it 
will  further  improve  the  task-specific  performance  in  all  three  tasks  reported  here.  Fourth, 
the  current  implementation  of  multi-dimensional  mass  estimation  using  Half-Space  Trees 
limits  its  applications  to  low  dimensional  problems  because  it  suffers  the  same  problem  as  in 
all  other  grid  oriented  methods.  We  will  explore  non-grid  oriented  implementations  of  mass 
which  have  potential  to  tackle  high  dimensional  problems  more  effectively  than  existing 
density-based  and  distance-based  methods. 


The  Matlab  source  codes  of  both  one-dimensional  and  multi-dimensional  mass 
estimations  are  available  at  http://sourceforge.net/projects/mass-estimation/. 
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Appendix:  Anomaly  detection  using  data  depth  that  builds  a  single  model  from  the 
entire  data  set 


This  appendix  provides  the  results  in  anomaly  detection  task  where  data  depth  and  local 
data  depth  built  a  single  model  from  the  entire  data  set,  i.e.,  DepthAD^.  This  is  in  contrast 
to  DepthAD  which  employed  an  ensemble  approach  in  Sect.  8. 

Table  17  shows  that  Mass  AD  generally  has  higher  AUC  than  DepthAD^  which  em¬ 
ployed  either  data  depth  or  local  data  depth.  The  only  exception  is  the  Annthyroid  data  set. 
Note  that  these  results  are  generally  worse  than  those  employing  an  ensemble  approach, 
reported  in  Table  15. 


Table  17  AUC  values  for 
anomaly  detection,  comparing 
Mass  AD  with  DepthAD^ 
(which  employed  either  data 
depth  or  local  data  depth)  that 
build  a  single  model  from  the 
entire  data  set 


MassAD 

DepthADj 

Mass" 

Mass' 

Depth 

LDepth 

Http 

1.00 

1.00 

0.84 

0.50 

Forest 

0.90 

0.92 

0.50 

0.55 

Mulcross 

0.26 

0.99 

0.88 

0.61 

Smtp 

0.91 

0.86 

0.86 

0.76 

Shuttle 

1.00 

0.99 

0.51 

0.70 

Mammography 

0.86 

0.37 

0.73 

0.62 

Annthyroid 

0.75 

0.71 

0.59 

0.85 

Satellite 

0.77 

0.62 

0.50 

0.70 
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Abstract  Density  estimation  is  the  ubiquitous  base  modelling  mechanism  employed  for 
many  tasks  including  clustering,  classification,  anomaly  detection  and  information  retrieval. 
Commonly  used  density  estimation  methods  such  as  kernel  density  estimator  and  A:-nearest 
neighbour  density  estimator  have  high  time  and  space  complexities  which  render  them  inap¬ 
plicable  in  problems  with  big  data.  This  weakness  sets  the  fundamental  limit  in  existing 
algorithms  for  all  these  tasks.  We  propose  the  first  density  estimation  method,  having  average 
case  sub-linear  time  complexity  and  constant  space  complexity  in  the  number  of  instances, 
that  stretches  this  fundamental  limit  to  an  extent  that  dealing  with  millions  of  data  can  now 
be  done  easily  and  quickly.  We  provide  an  asymptotic  analysis  of  the  new  density  estimator 
and  verify  the  generality  of  the  method  by  replacing  existing  density  estimators  with  the 
new  one  in  three  current  density-based  algorithms,  namely  DBSCAN,  LOF  and  Bayesian 


The  source  codes  of  DEMass-DBSCAN  and  DEMass-Bayes  are  available  at  http://sourceforge.net/projects/ 
mass-  estimation/ . 
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classifiers,  representing  three  different  data  mining  tasks  of  clustering,  anomaly  detection  and 
classification.  Our  empirical  evaluation  results  show  that  the  new  density  estimation  method 
significantly  improves  their  time  and  space  complexities,  while  maintaining  or  improving 
their  task-specific  performances  in  clustering,  anomaly  detection  and  classification.  The  new 
method  empowers  these  algorithms,  currently  limited  to  small  data  size  only,  to  process  big 
data — setting  a  new  benchmark  for  what  density-based  algorithms  can  achieve. 

Keywords  Density  estimation  •  Density-based  algorithms 


1  Introduction 

Density  estimation  is  ubiquitously  applied  to  various  tasks  such  as  clustering,  classification, 
anomaly  detection  and  information  retrieval.  Despite  its  pervasive  use  (‘estimation  of  densi¬ 
ties  is  a  universal  problem  of  statistics’  [36]),  there  are  no  efficient  density  estimation  methods 
thus  far.  Most  existing  methods  such  as  kernel  density  estimator  (KDE)  and  /.'-nearest  neigh¬ 
bour  (k-NN)  density  estimator  cannot  be  applied  to  problems  with  big  data.  This  paper  is 
motivated  to  introduce  the  first  efficient  density  estimation  method  for  big  data.  We  show  that 
two  existing  density-based  algorithms  for  clustering  and  anomaly  detection,  when  employing 
the  new  density  estimator,  set  a  new  runtime  benchmark  that  is  orders  of  magnitude  faster. 
For  example,  the  clustering  algorithm  now  takes  only  hours  instead  of  more  than  one  month 
to  complete  a  task  involving  one  million  of  instances,  after  the  existing  density  estimator  is 
replaced  with  the  new  one. 

We  make  five  contributions  in  this  paper: 

1 .  Propose  a  new  density  estimation  method  which  has  a  significant  advantage  over  existing 
methods  in  terms  of  time  and  space  complexities. 

2.  Establish  the  asymptotic  behaviour  of  the  method  through  a  bias-variance  analysis. 

3.  Verify  the  generality  of  the  method  by  replacing  existing  density  estimators  with  the  new 
one  in  three  current  density-based  algorithms. 

4.  Significantly  simplify  and  speed  up  the  two  current  algorithms  in  anomaly  detection  and 
clustering  tasks  using  set-based  definitions  instead  of  the  common  point-based  defini¬ 
tions. 

5.  Introduce  the  first  Bayesian  classifier  with  constant  training  time  complexity  in  the  num¬ 
ber  of  instances.  The  proposed  Bayesian  classifier  estimates  the  multidimensional  like¬ 
lihood  directly  from  the  training  data,  unlike  most  existing  Bayesian  classifiers  which 
estimate  single-dimensional  likelihood. 

The  new  density  estimation  method  distinguishes  itself  from  existing  methods  by: 

•  Employing  no  distance  measures  in  the  density  estimation  process. 

•  Having  constant  space  complexity  and  average  case  sub-linear  time  complexity  in  the 
number  of  instances  in  unsupervised  learning  tasks  and  constant  training  time  complexity 
in  the  number  of  training  instances  in  supervised  learning  tasks.  Thus,  it  can  be  applied 
to  big  data  in  which  current  methods  such  as  kernel  and  k-NN  density  estimators  are 
infeasible  because  they  are  expensive  to  compute. 

Three  existing  density  estimators  are  presented  in  Sect.  2,  in  order  to  contrast  with  the 
new  density  estimator  we  introduce  in  Sect.  3.  We  analyse  the  error  produced  by  the  new 
estimator  by  a  bias-variance  analysis  and  provide  a  comparison  of  the  estimation  results 
between  the  new  estimator  and  KDE  in  Sect.  4.  Sections  5  and  6  describe  how  the  new 
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estimator  can  replace  the  existing  density  estimators  in  three  current  state-of-the-art  density- 
based  algorithms  and  their  empirical  evaluation  results,  respectively.  A  discussion  of  the 
related  issues  and  the  conclusions  are  provided  in  the  last  two  sections. 


2  Density  estimation 

This  section  describes  probably  the  three  most  commonly  used  density  estimation  methods, 
namely  KDE,  /.'-nearest  neighbour  density  estimator  and  6 -neighbourhood  density  estimator. 

2. 1  Kernel  density  estimator 

Let  x  be  an  instance  in  a  d -dimensional  space  lZd .  The  KDE  defined  by  a  kernel  function 
K(-)  and  bandwidth  b  is  given  as  follows  [29]: 

The  difference  x  —  x,  requires  some  form  of  distance  measure,  and  n  is  the  number  of 
instances  in  the  given  data  set  D.  An  example  of  K (•),  as  a  rectangular  function,  is  given  as 
follows: 


K(x) 


|  if  |x|  <  1 
0  otherwise. 


2.2  k-NN  density  estimator 

A  k-NN  density  estimator  can  be  expressed  as  follows  [30]: 


/kNN(x)  = 


\N(x,  k) | 

n  Xx'eivoa)  distance^,  x') 


where  N(x,  k )  is  the  set  of  k  nearest  neighbours  to  x,  and  the  search  for  nearest  neighbours 
is  conducted  over  D  of  size  n. 

2.3  6-Neighbourhood  density  estimator 

The  6-neighbourhood  density  estimator  [12]  is  defined  as  follows: 


/e(x) 


\N,(x)\ 

ne 


where  Ne(x)  =  [x'  e  D|distance(x,  x')  <  e}. 

All  the  three  density  estimators  have  O  (n  2 )  time  complexity  and  O  (n )  space  complexity  in 
order  to  estimate  the  densities  of  n  instances.  Although  there  are  various  indexing  schemes  to 
speed  up  the  search  for  nearest  neighbour  in  order  to  aid  the  k-NN  and  6 -neighbourhood  den¬ 
sity  estimators,  they  are  not  satisfactory  in  terms  of  dealing  with  high-dimensional  problems 
and  large  data  sets.  We  will  provide  further  discussion  of  this  issue  in  Sect.  7. 
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3  Density  estimator  based  on  mass 

A  recently  introduced  base  measure  called  mass  [35]  has  demonstrated  its  wide  application 
to  solve  various  data  mining  tasks  such  as  regression,  information  retrieval,  clustering  and 
anomaly  detection,  including  one  in  data  stream  [31,34,35], 

Because  mass  is  more  fundamental  than  density,  we  show  in  this  paper  that  a  density 
estimator  can  be  constructed  from  mass.  The  key  advantage  of  mass  is  that  it  can  be  computed 
very  quickly.  The  new  density  estimator  based  on  mass  (DEMass)  inherits  this  advantage 
and  executes  significantly  faster  than  existing  density  estimators  such  as  KDE,  k- NN  and 
e-neighbourhood.  It  raises  the  capability  of  density-based  algorithms  to  handle  big  data  to  a 
new  high  level. 

A  mass  base  function  is  defined  as  follows  by  [34]: 


m  if  xisinaregionofT(-), 
0  otherwise, 


m(7(x))  = 


where  T (•)  is  a  function  which  subdivides  the  feature  space  into  non-overlapping  regions 
based  on  the  given  data  set  D  and  m  is  the  number  of  samples  in  a  region  of  T  (x)  in  which 
x  falls  into. 

Ting  and  Wells  [34]  show  that  mass  can  also  be  effectively  estimated  using  data  subsets 

T>i  C  D{i  =  1 . f)  andits  associated  7/(x|T>,),  where  |T>,  |  =  ^  n.  Each©,  is  sampled 

without  replacement  from  D.  The  mass  estimated  using  sub-samples  is  defined  as  follows: 


i=\ 

We  now  introduce  the  new  density  estimators  based  on  mass  (DEMass)  and  describe  its 
implementation  in  the  next  two  subsections. 

3.1  DEMass 

Once  mass  is  estimated,  density  can  be  estimated  as  a  ratio  of  mass  and  volume.  Thus, 
the  new  density  estimators  based  on  mass  functions  m(T (x))  and  m(7}  (x|T>;))  are  defined, 
respectively,  as: 


(1) 


(2) 


where  v  and  u,-  are  the  volumes  of  regions  T  (x)  and  7)  (x|D(  ),  respectively. 

We  use  the  term  DEMass  to  refer  to  the  new  density  estimator  constructed  from  7)  (x|7?,  ) 
in  the  rest  of  this  paper.  DEMass  has  two  key  differences/advantages  when  compared  to  the 
one  based  on  a  kernel  method,  k-NN  or  e-neighbourhood: 

•  fm  is  estimated  from  t\[r  instances  only  which  are  significantly  smaller  than  D  in  a  large 
data  set.  It  sums  over  t  number  of  randomly  generated  regions,  whereas  /kde  sums  over 
n  number  of  instances  in  D,  and  /Jcnn  and  fc  require  a  search  on  the  entire  data  set.  For 
a  large  data  set,  /  is  prohibitively  expensive  to  compute  in  these  three  methods.1 

1  While  there  are  ways  to  reduce  the  computational  cost  of  KDE,  A-NN  and  e-neighbourhood,  they  are  usually 
limited  to  low-dimensional  problems  or  incur  significant  preprocessing  cost.  See  Sect.  7  for  a  discussion. 
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•  fm  needs  no  distance  measures. 

3.2  Implementation 

Mass  estimation  can  be  implemented  in  different  ways  [3 1 , 34, 35].  We  use  the  implementation 
of  T (-| 22)  as  described  by  [34]  using  a  binary  tree  (called  li:d-T rees  by  [34])  as  the  basis  to 
build  density  estimator  fm.  Algorithm  1  generates  t  trees  from  a  given  data  set  D.  Algorithm 
2  generates  a  single  tree  using  a  subset  22  c  73,  where  |22|  =  i/s. 

When  T(- |22)  is  implemented  as  a  binary  tree,  the  volumes  of  regions  in  7’(-|22)  are 
controlled  by  a  parameter  h  which  defines  the  level  of  binary  subdivision.  Let  A,  be  a 
workspace  in  Hr  which  envelops  22,,  and  A,  has  its  length  along  each  dimension  j  as 
A ij  =  max(jci7  |xjt  e  22,)  —  mm(xkj\n.k  €  22,).  The  workspace  A,-  is  adjusted  to  become  Lo¬ 
using  a  random  perturbation  conducted  as  follows.  For  each  dimension  j ,  a  split  point  vj  is 
chosen  randomly  within  the  range  A; j .  Then,  the  new  range  £2; j  along  dimension  j  is  defined 
as  | vj  —  r,  Vj  +  r],  where  r  =  max(vj  —  tniny  (A,),  max7(A,)  —  vj).  The  new  ranges  on 
all  dimensions  define  the  adjusted  workspace  £2,  for  the  tree-building  process.  This  random 
initialisation  of  the  workspace  is  done  in  line#  5  of  Algorithm  1 . 

A  subset  22,  is  constructed  by  sampling  i/r  instances  without  replacement  from  D.  It  is 
used  to  construct  7}  ( - ) .  If  there  are  not  enough  instances  to  sample,  all  the  instances  are  used 
(i.e.  \js  =  n,  if  \j/  >  n).  The  random  adjustment  of  the  work  space  as  described  earlier  ensures 
that  no  two  trees  are  identical  even  if  they  are  constructed  from  the  same  set  of  instances. 
The  random  sampling  is  done  in  line#  4  of  Algorithm  1 . 

The  dimension  to  split  is  selected  from  a  randomised  set  of  d  dimensions  in  a  round-robin 
manner  at  each  level  of  a  tree.  At  each  level,  the  workspace  is  divided  into  two  equal-volume 
half-spaces  by  splitting  at  mid-point  of  the  selected  dimension.  The  process  is  then  repeated 
recursively  on  each  non-empty  half-space  until  the  maximum  height  ( h  x  d)  is  reached. 
Hence,  each  path  from  the  root  to  a  leaf  has  h  x  d  nodes  such  that  each  of  the  d  attributes 
appears  exactly  h  times. 

Each  Tj(-\Vj)  is  constructed  within  workspace  S2,-,  resulting  in  potentially  2hd  hyper- 
rectangular  regions  where  every  region  has  an  equi-width  Sxij  =  £2,;-  /2h  on  each  dimension 
j  and  a  volume  i>,-  =  Sxn  x  •  •  •  x  Sxtd .  For  example,  in  a  one-dimensional  space  with 
workspace  £2,  derived  from  22,  and  h  =  3,  7)  (- 12 2,)  subdivides  the  workspace  into  23  regions. 
We  use  Tj  to  denote  72,  unless  h  is  required  in  the  context,  and  7)  is  built  from  22,,  for 
each  i . 

Let  £  =  h  x  d,  and  be  the  mass  of  region  k.  There  is  a  total  of  2e  regions  which  have 
a  total  mass:  |22|  =  2^4=1  where  m,t  =  m(7’ (x|22)),  and  x  is  in  region  k  of  T . 

Our  implementation  reduces  the  number  of  regions  generated,  usually  less  than  2hd ,  for 
ijr  n.  At  each  node,  if  one  of  the  two  child  nodes  is  empty,  the  range  of  the  node  is 
reduced  to  the  range  of  the  non-empty  child  node  instead  of  creating  the  empty  node  (line# 
7-12  in  Algorithm  2).  This  avoids  creating  unnecessary  empty  regions  and  reduces  memory 
requirement. 

The  height  of  each  tree  is  hd.  At  each  level  of  a  tree,  each  instance  in  22  has  to  be  assigned 
to  either  of  the  two  child  nodes.  Thus,  the  time  complexity  of  constructing  t  trees  is  0{thd\j/). 

There  are  a  total  of  min(2/,d ,  i/r)  leaf  nodes  in  each  tree.  In  general,  ^  <  2hd  with  moderate 
d  and  h.  Thus,  the  space  complexity  is  0(td \//  +  n)  during  construction.  After  the  trees  are 
built,  the  data  set  is  discarded,  yielding  0(td\/f). 

To  estimate  the  density  of  a  given  instance  x,  only  these  trees  are  used  according  to  Eq.  (2). 

In  the  next  section,  we  will  show  that  the  bias  between  /m(x)  and  the  true  probability 
density  function  pd(x)  converges  asymptotically  through  a  bias-variance  analysis. 
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Algorithm  1  :  BuildTrees(D,  t ,  i// .  h ) 

Inputs:  D  -  input  data  with  d  attributes,  1  -  number  of  trees,  i//  -  sub-sampling  size,  h  -  number  of  times  an 
attribute  is  employed  in  a  path. 

Output:  F  -  a  set  of  t  h.d-T rees 

1:  Max  Height  Limit  <—  h  x  d 

2:  Initialise  F 

3:  for  i  =  1  to  t  do 

4:  T)  <—  sample(D,  \Js)  {strictly  without  replacement} 

5:  (min, max)  <—  InitialiseWorkSpace(D) 

6:  A  <—  Randomised  list  of  d  attributes. 

7:  F  <—  F  U  SingleTree(X>,  min,  max,  0) 

8:  end  for 


Algorithm  2  :  SingleTree(2V  min,  max,  t) 

Inputs:  T>  -  input  data,  min  &  max  -  arrays  of  minimum  and  maximum  values  for  each  of  the  d  attributes 
that  define  a  work  space,  i  -  current  height  level. 

Output:  an  h:d-Trees 

1:  Initialise  Node(-) 

2:  while  (l  <  M ax  Height  Limit)  do 

3:  q  <—  next Attribute(A,  i)  {Retrieve  an  attribute  from  A  based  on  height  level.} 

4:  midq  <—  (maxq  +minq)/ 2 

5:  T>i  <—  filter(D,  q  <  midq) 

6:  XV  <—  filter(T>,  q  >  midq) 

7:  if  (| T>[  |  =  0  )  or  (| T>r  \  =  0)  then  {Reduce  range  for  single-branch  node. } 

8:  if  (| T>\  |  >  0  )  then  maxq  «—  midq 

9:  else  minq  <—  midq 

10:  end  if 

11:  £<-i  +  l 

12:  continue  at  the  start  of  while  loop 

13:  end  if 

14:  {Build  two  nodes:  Left  and  Right  as  a  result  of  a  split  into  two  equal- volume  half-spaces.} 

15:  temp  maxq ;  maxq  •<—  midq 

16:  Left  <—  SingleTree(X>/,  min,  max,  i  +  1) 

17:  maxq  <—  temp’,  minq  midq 

18:  Right  •<—  SingleTree(XV,  min,  max,  i  +  1) 

19:  terminate  while  loop 

20:  end  while 

21:  return  Node(Left,  Right,  Split Att  <—  q,  SplitV alue  <—  midq.  Size  <—  \T>\) 


4  Error  analysis  through  bias-variance  decomposition 

The  DEMass  fm  (x)  can  be  thought  of  as  a  random  variable  because  of  its  dependence  on  D 
and  its  random  sub-samples  XV  (/  =  l, ...  ,t).  Accordingly,  we  analyse  mean  squared  error 
(MSE)  of  /m(x)  from  its  true  probability  density  Pt/(x).  It  is  defined  as: 

MSE(/m(x))  =  £[{/m(x)  -  W(x)}2] 

where  the  expectation  £[■]  is  taken  over  the  distribution  of  /m(x).  This  is  rewritten  by 
introducing  the  expectation  of  /m(x):  E [/m(x)]  as  follows  [29]: 

MSE(/m(x))  =  {£[/m(x)]  -  Mx)}2  +  £[{/m(x)  -  £[/m(x)]}2] 

The  first  term  on  the  rhs  is  called  1 square  bias  ’  and  the  second  ‘variance' .  We  evaluate 
the  magnitude  of  each  of  these  two  terms  in  the  following. 
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To  simplify  notations  for  the  rest  of  the  paper,  we  have  used  7}  (x)  to  denote  7}  (x| T>i)  and 
p(Ti(x))  to  denote  p(xk  e  7}(x)|  x*  e  Vi). 

Let  c,  be  the  centre  of  a  region  of  7)  (x)  where  each  element  c,y  of  c,  is  a  middle  point  of 
the  interval  on  each  dimension  j .  The  second-order  Taylor  approximation  of  p,i  (x)  around 
c for  Tf  (x)  is  given  as: 


W(x)|c,eT(x)  ^  Pd(d)  +  (x  -  c,)TVprf(x)|x=Ci. 

+  ^{(x-c;)TV}2W(x)|x=c,,  (3) 

where  V  =  [9/9xi, . . . ,  9/9x^]T. 

Note  that  m(7}(x))  follows  a  binomial  distribution2  B(xj/,  p(Tj(x))).  Therefore,  Et/mfx)] 
is  expressed  by  substituting  Z?[m(7)(x))]  =  \[rp(Ti(x))  in  Eq.  (2). 


1  ^  E[m(7}(x))] 

E\fm(x)\  =  yX 
i=l 


t hi 


1  y  P(T,(x)) 

t  “  Vi 
1  =  1 


=  1Z1 

/  “  Di 


Pd  ( X*  )d\t . 


(4) 


!  =  1 


T(x) 


Accordingly,  the  square  bias  is  evaluated  as  follows  by  applying  Eq.  (3)  and  the  fact  that 
the  integral  of  an  odd  function  over  [cy  —  Sxj  /2,  Cij  +  Sxj/ 2]  for  each  dimension  j  is  zero. 


{£|/m(x)l  -  Pd(x )}2 

1  1  d^pdix) 

7  2-.  24  2-, 


,  -  ■  i  dx1 

,=i  ]=\  j 


Sxfj  -  (x  -  c,)TV/^(x)|x=c 


--{(x  -  C;)TV}2/?d(x)|x=c 


I2 

C;€7;(X)- 


1  1  9  pd(x) 

7  ^  24  2- 

■  i= i  ;  ' 

.  d  d 


.  ,  -  ■  ■  I  dx2: 

i=l  j= 1  7 

d2Pd  (x) 


k/2~2,,+z 

/'= 1 


9w(x)  I 


9x 


A,y2 


7  I  x=c; 


=  0  (4  ) 


7=1  7=1 
h\ 


dx  idxk  L=c 


A/ >  A;*;  2 


-2/2 


Cje7}(x). 


This  result  shows  that  the  square  bias  diminishes  as  level  h  increases,  i.e.,  as  the  size  of  the 
regions  decreases.  Though  this  analysis  uses  the  second-order  approximation  of  pd(x),  the 
result  using  the  higher-order  approximation  is  the  same  since  the  first-order  term  dominates 
in  the  above  formula. 


2  The  implementation  of  Tf-)  used  in  this  paper  is  a  tree-based  nonparametric  method.  The  binomial  distrib¬ 
ution  is  required  for  the  error  analysis  only. 
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Because  m(7;(x))  follows  the  binomial  distribution  p(Tj(x))),  the  variance  of 
m(T,  (x))  is: 


var[m(7}(x))]  =  to>(7j(x))(l  -  p{T,{x))). 

In  concert  with  Eq.  (2),  the  variance  of  /m(x)  is  represented  as  follows: 
£[{/m(x)  -  E[fm(x)]}2] 


1  P(7f(x))(l  -p(T,(x))) 

t2  £  fv2 


>Ti(x ) 


Pd  (x*)dx. 


) 


Using  the  similar  calculus  as  applied  to  the  square  bias,  we  obtain  the  variance  as  follows 
where  Cj  is  a  centre  of  7}  (x). 


£[{/m(x)  -  £[/m(x)]}2] 


=  0(2dh) 


This  result  indicates  that  the  variance  increases  when  level  h  increases.  Also,  the  result 
does  not  change  even  if  we  use  the  higher-order  approximation  because  the  term  pj  (c, )  /  v ,• 
dominates  in  the  above  formula. 

The  property  of  DEMass,  revealed  from  this  error  analysis,  is  similar  to  that  of  the  con¬ 
ventional  KDE  which  shows  a  bias-variance  trade-off — the  bias  decreases  as  the  kernel 
bandwidth  b  decreases,  but  this  increases  the  variance,  and  the  reverse  is  true  if  the  kernel 
bandwidth  is  increased  [29].  The  parameter  k  in  £-NN  density  estimator  has  the  same  effect. 

In  conclusion,  DEMass  has  a  comparable  estimation  of  density  with  the  KDE  if  both  trade¬ 
off  bias  and  variance  are  equally  well,  and  it  is  indeed  the  case  in  practice.  Figure  1  shows 
the  estimation  result  of  a  normal  distribution  using  KDE  and  DEMass.  It  demonstrates  that 
DEMass  produces  similar  result  to  that  generated  by  KDE,  for  different  data  sizes.  Smoothing 
can  be  applied  by  increasing  b  for  KDE  or  decreasing  h  for  DEMass  which  produces  the 
estimation  results  as  shown  in  Fig.  2.  The  parameters  used  for  DEMass  are  the  following: 
t  =  1,000  and  i jr  =  n  when  n  =  10,  100;  and  i/r  =  1,000  when  n  =  1,000,000. 

Note  that  in  either  setting  shown  in  Figs.  1  and  2,  the  estimations  of  both  KDE  and  DEMass 
approach  the  true  distribution  as  the  number  of  instances  increases. 
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KDE  (6  =  0.1) 


DEMass  (6  =  5) 
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Fig.  1  Example  estimations  of  Kernel  density  estimator  (with  Gaussian  kernel)  using  6  =  0. 1  and  DEMass 
using  h  =  5  for  different  data  sizes,  n  =  10,  100,  1,000,000.  The  true  data  distribution  is  a  normal  distribution 
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Fig.  2  Example  estimations  of  Kernel  density  estimator  (with  Gaussian  kernel)  using  b  =  0.3  and  DEMass 
using  6  =  3  for  the  same  data  used  in  Fig.  1 


5  Using  DEMass  in  existing  density-based  algorithms 

This  section  describes  how  DEMass  can  be  applied  to  three  current  density-based  algorithms, 
DBSCAN  [12],  LOF  [6]  and  Bayesian  classifiers,  in  place  of  their  existing  density  estimators. 
DBSCAN,  LOF  and  Bayesian  classifiers  are  one  of  the  best  algorithms  for  clustering,  anomaly 
detection  and  classification,  respectively. 

Using  DEMass  automatically  carries  the  two  advantages  mentioned  in  Sect.  3:  (i)  the 
estimation  requires  no  distance  measures;  thus,  it  completely  saves  the  cost  of  distance 
calculations  for  every  pair  of  instances  and  (ii)  DEMass  enables  small  samples  to  construct 
the  required  regions  T  (•  |  T>),  overcoming  the  key  limitation  of  DBSCAN  and  LOF  in  handling 
big  data.  We  will  discuss  further  advantages  specific  to  individual  algorithms  in  the  following 
subsections. 


<£)  Springer 


K.  M.  Ting  et  al. 


Table  1  Algorithms  for  DBSCAN  and  DEMass-DBSCAN 

Step 

DBSCAN 

DEMass-DBSCAN 

1 

Label  all  points  as  core, 
border  or  noise  points, 
based  on  f€  (x) 

Label  all  T  (x)  satisfying 
Definition  S 1  in  Table  2  as  core 
regions,  based  on  /m(x).  Points 
not  covered  by  core  regions  are 
noise 

2 

Eliminate  noise  points 

Eliminate  noise  points 

3 

Connect  all  core  points 
that  are  within  e  of  each 
other 

Connect  all  core  regions  that 
have  non-zero  intersections 

4 

5 

Make  each  group  of 
connected  core  points 
into  a  separate  cluster 

Assign  each  border  point 
to  one  of  the  clusters  of 
its  associated  core 
points 

Make  each  group  of 
connected  core  regions  into 
a  separate  cluster 

Time  complexity 

0(dn~) 

0(thdn ) 

Space  complexity 

0(dn) 

O(tdf) 

Note  that  border  points  are  not  required  with  DEMass-DBSCAN;  thus  step  5  is  not  needed.  Both  versions  of 
DBSCAN  could  include  an  additional  cluster  size  threshold  to  eliminate  small  size  clusters  in  the  last  step 


5.1  DEMass-DBSCAN 

The  principal  steps  of  DEMass-DBSCAN  are  the  same  as  DBSCAN,  except  that  no  border 
points  and  their  associated  step  are  required.  A  comparison  of  the  two  algorithms  is  provided 
in  Table  1.  The  algorithm  for  DBSCAN  is  adapted  from  [30]. 

Following  its  principal  steps,  the  use  of  DEMass  simplifies  DBSCAN  in  two  ways,  in 
addition  to  the  two  advantages  already  mentioned  above.  First,  DEMass  enables  regions  to 
be  labelled  instead  of  individual  points.  Because  the  number  of  regions  is  significantly  less 
than  the  number  of  points,  labelling  and  linking  required  in  steps  1  and  3  become  significantly 
faster.  Second,  no  border  points  need  to  be  defined  because  the  connections  within  a  cluster 
are  established  via  core  regions  only  when  DEMass  is  used.  The  first  simplification  is  the 
key  reason  for  the  significant  speed  up  achieved  by  DEMass-DBSCAN,  which  we  will  show 
in  Sect.  6.1. 

The  time  complexity  to  construct  t  trees  is  0(thd\//).  In  order  to  assign  a  cluster  to  each 
instance,  each  of  the  t  trees  is  traversed  from  the  root  to  the  respective  leaf  node.  The  time 
complexity  of  that  searching  is  0{thdn).  Thus,  the  overall  time  cost  of  DEMass-DBSCAN 
is  Oithdijr  +  thdn).  Since  n,  the  first  term  can  be  ignored.  Hence,  the  total  time 

complexity  of  DEMass-DBSCAN  is  0(thdn).  But,  the  time  complexity  of  DBSCAN  is 
0(dn 2)  as  it  requires  pairwise  distance  calculation  between  n  instances  in  order  to  search 
for  e-neighbours.  Also,  DBSCAN  needs  to  store  all  the  instances  in  order  to  cluster  a  future 
instance,  yielding  space  complexity  of  0(dn).  But,  DEMass-DBSCAN  requires  space  to 
store  t  h:d-Trees  only,  which  is  0(td\lr).  The  space  complexity  of  DEMass-DBSCAN  is 
constant  (independent  of  n). 

The  DEMass-DBSCAN  algorithm  shown  in  Table  1  is  derived  from  set-based  definitions, 
and  the  DBSCAN  algorithm  is  derived  from  point-based  definitions.  We  can  use  point- 
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Table  2  Point-based  and  set-based  definitions  for  DEMass-DBSCAN.  Note  that  T  (•)  is  used  to  denote  T  (•  \V) 
Point-based  definitions  Set-based  definitions 


Definition  PI:  The  h -neighbourhood  of  a 
point  p,  denoted  by  N h  (/?),  is  defined  by 
Nh(p)  =  [q  e  D\q  6  Th  (p)} 

In  contrast,  the  6 -neighbourhood  for  DBSCAN 
is  defined  as  N€{p)  =  {q  e  D\dist(p,  q )  <  e], 
which  requires  a  distance  function  dist( •,  •) 

No  distance  functions  are  required  for  Afyj(-) 
Definition  P2:  A  point  p  is  directly  density- 
reachable  from  a  point  q  wrt  h  and  MinPts  if 

(i)  p  e  Nh(q)  and 

(ii)  1  >  MunP,S  (core  point  condition) 

v  Vmax 

Definition  P3:  A  point  p  is  density-reachable 
from  a  point  q  wrt  h  and  MinPts  if  there  is  a 
chain  of  points  p\ , . . . ,  pn,  where  p\  =  p  and 
pn  =  q  such  that  Pi+\  is  directly  density- 
reachable  from  pi 

Definition  P4:  A  point  p  is  density-connected 
to  a  point  q  wrt  h  and  Min  Pts  if  there  is  a 
point  o  such  that  both  p  and  q  are  density- 
reachable  from  o  wrt  h  and  MinPts 


Definition  P5:  Let  D  be  a  data  set  of  points 
A  cluster  C  wrt  h  and  MinPts  is  a  non-empty 
subset  of  D  satisfying  the  following  conditions: 

(i)  V/7,  q:  if  p  e  C  and  q  is  density-reachable  from 
p  wrt  h  and  MinPts,  then  q  e  C  (Maximally) 

(ii)  V/?,  q  e  C:  p  is  density-connected  to  q  wrt  h 
and  MinPts  (Connectivity) 

Definition  P6:  Let  C\ ,  . . . ,  be  the  clusters  of 
the  data  set  D  wrt  h  and  MinPts 
Then  we  define  noise  as  the  set  of  points  in  the 
data  set  D  not  belonging  to  any  cluster  Cj,  i.e., 
noise  =  {pe  Z)|V j  :  p  £  Cj} 


Definition  SI:  T (x)  is  a  core  region  of 


point  x  wrt  h  and  MinPts  if 

m(T(x))  MinPts  . 

V  -  vmax  ’ where  Vmax 


max,-  Vi 


and  v  is  the  volume  of  region  T  (x) 


Definition  S2:  Tr(-)  is  density-connected 
to  Ts(-)  wrt  h  and  MinPts  if  there  is  a 
chain  of  regions  T\  (•), . . . ,  Tg{-)  where  r  =  1 
and  s  =  g  such  that  Tt  (•)  D  (•)  7^  0  and 
Tt  (•)  is  a  core  region  for  1  e  { 1 ,  •  •  •  ,  g}  wrt 
h  and  MinPts 

Definition  S3:  An  arbitrary-shape 
cluster  C  wrt  h  and  MinPts  is  a  non¬ 
empty  subset  of  a  data  set  D  satisfying  the 
following  conditions:  Vr,  s;  Tr{ •),  Ts( •)  C  C: 
Tr( •)  is  density-connected  to  7^(-)  wrt  h 
and  MinPts 

Definition  S4:  Let  C\ ,  . . . ,  be  the 
clusters  of  D  wrt  h  and  MinPts 
Noise  is  the  set  of  points  in  D  not 
belonging  to  any  cluster  C j ,  i.e., 
noise  =  {x  e  D\Vj  :  x  ^  Cj } 


The  point-based  definitions  are  adopted  from  those  defined  for  DBSCAN  [12] 


based  definitions  for  DEMass-DBSCAN,  except  that  the  neighbourhood  definition  needs  to 
be  adapted  to  DEMass.  Although  point-based  definitions  can  be  defined  as  in  DBSCAN 
[12],  set-based  definitions  are  simpler.  The  formal  point-based  and  set-based  definitions  for 
DEMass-DBSCAN  are  given  in  Table  2. 

A  comparison  between  DBSCAN  and  DEMass-DBSCAN  is  provided  using  two  examples 
showed  in  Fig.  3a,  b.  They  show  how  core  points  and  non-core  points  are  labelled  in  DBSCAN 
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Fig.  3  An  example  for  DBSCAN  and  DEMass-DBSCAN  for  MinPts  =  5.  a  An  example  for  DBSCAN  for 
MinPts  =  5.  A  is  a  core  point,  B  is  a  border  point,  and  C  is  a  noise  point,  b  An  example  for  DEMass-DBSCAN 
for  MinPts  =  5.  The  circle  symbol  indicates  core  points,  and  the  star  symbol  indicates  noise  points.  Ty,  T4 
and  75  are  core  regions.  T3  and  T4  are  linked  by  a  common  core  point 


Table  3  Algorithms  for  LOF  and  DEMass-LOF 


Step 

LOF 

DEMass-LOF 

1 

Compute  density  distribution:  fkNN(x) 

Compute  density  distribution: /m(x) 

2 

Compute  LOF(x)  using 

^  .  fkNN(x') 

\N(x,k)\ 

Compute  LOFp (x) 

1  m(7-(x)) 

t  Z^i= 1  ^  5 

3 

fkNN  (*) 

Rank  all  instances  based  on  their  LOF 
values  in  descending  order 

USmg!  /m(x) 

Rank  all  instances  based  on  their  LOFp 
values  in  descending  order 

Time  complexity 

0(dn 2) 

0(thdn ) 

Space  complexity 

0(dn ) 

CHtdir) 

fkNN (x)  and  IV  (x,  k)  are  defined  in  Sect.  2.2;  /m(x)  is  defined  in  Sect.  3.  7)  (x)  D  7}  (x)  correspond  to  the 
parent  and  child  nodes  in  our  tree  implementation;  1//  and  Vj  are  the  data  size  and  volume  of  Tj  (x),  respectively. 
Note  that  1)  fx),  the  next  superset  of  7’.  (x),  is  not  necessarily  (x)  because  there  are  cl  levels  in  the  tree 
for  each  increment  of  h  and  the  implementation  allows  single-branch  extensions  if  there  are  no  data  in  other 
branches.  See  Sect.  3.2  for  details  of  the  implementation 


and  DEMass-DBSCAN.  One  superficial  difference  in  these  examples  is  that  DBSCAN  uses 
hyper-spheres  and  DEMass-DBSCAN  uses  hyper-rectangles.  This  difference  can  be  easily 
eliminated  by  using  L 00 -norm  (instead  of  L2-norm)  in  DBSCAN. 

5.2  DEMass-LOF 

Table  3  compares  the  algorithms  for  LOF  and  DEMass-LOF  which  have  three  identical 
principal  steps:  compute  density  distribution  and  LOF ,  and  then  rank  all  instances  based  on 
their  LOF  values.  The  key  difference  is  the  density  estimator  used  in  step  1  which  changes 
the  computation  of  LOF  in  step  2. 

In  addition  to  the  two  advantages  due  to  the  use  of  DEMass  mentioned  in  Sect.  3,  the 
advantage  specific  to  LOF  is  that  DEMass  enables  the  computation  of  the  relative  density  to 
be  substantially  simplified,  changing  from  nearest-neighbour-based  to  set-based.  Instead  of 
finding  the  neighbours  of  x  and  then  computing  the  density  of  each  neighbour,  the  modified 
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ranking  measure  LOFp  is  computed  based  on  the  region  T  (x)  and  its  immediate  larger  region 
T (x)  D  T (x).  In  the  tree  implementation  of  T (•),  this  corresponds  to  computing  the  density 
of  the  node  in  which  x  falls  into,  relative  to  the  density  of  its  parent  node. 

In  steps  1  and  2,  the  time  complexities  of  DEMass-LOF  and  LOF  are  0{thdn)  and 
0(dn2),  respectively.  Since  DEMass-LOF  does  not  need  to  perform  neighbourhood  search 
as  in  LOF,  it  is  much  faster,  especially  in  large  data  sets.  Unlike  LOF  that  stores  all  n  instances, 
DEMass-LOF  needs  space  to  store  t  trees  only.  Hence,  the  space  complexity  of  DEMass-LOF 
is  OUdi/f). 

The  parameter  k  in  LOF  has  an  inverse  relationship  with  h  in  DEMass-LOF,  i.e.,  high  h 
corresponds  to  low  k  (which  covers  a  smaller  region  than  that  using  low  h  or  high  k).  A  larger 
k  increases  LOF’s  processing  time  so  as  a  larger  h  increases  DEMass-LOF’s  processing 
time. 

Note  that  both  LOF  and  LOFp  are  relative  density  scores,  which  range  from  0  to  +oo, 
indicating  the  degree  of  anomaly;  the  higher  the  score,  the  higher  the  degree  of  anomaly. 

5.3  DEMass-Bayes 

Bayesian  classifiers  require  density  estimation  in  order  to  estimate  the  class  conditional  prob¬ 
ability  of  test  instance  x  given  class  y,  i.e.,  /?(x|  y).  Because  it  is  difficult  to  compute  p(x|y) 
directly  even  in  problems  with  a  moderate  number  of  dimensions,  a  number  of  assump¬ 
tions  have  been  made  to  simplify  the  computation.  We  describe  those  used  in  Naive  Bayes 
(NB)  [21],  Bayesian  networks  (BayesNet)  [15]  and  Aggregating  One-Dependence  Estimators 
(AODE)  [38]. 

Naive  Bayes  assumes  class  conditional  independence  and  estimates  density  distribution 
on  each  dimension  separately  [21]. 


d 


(5) 


For  continuous-valued  attributes,  p(xt \y)  can  be  computed  either  through  discretisation  or 
using  a  density  estimator.  Naive  Bayes  with  discretisation  (NB-Disc)  [11]  estimates  p{xt\y) 
through  discretisation.  Naive  Bayes  with  Gaussian  distribution  (NB-GD)  [21]  estimates 
p(xj\y)  through  normal  probability  estimation.  Naive  Bayes  with  kernel  density  estimation 
(NB-KDE)  [22]  estimates  p{xj\y)  through  a  kernel  density  estimator. 

The  assumption  made  by  Naive  Bayes  is  often  violated  in  the  real  world  where  attributes 
are  related  in  some  way.  Other  Bayesian  classifiers  such  as  BayesNet  [15]  and  AODE  [38] 
employ  less  restrictive  assumptions. 

BayesNet  learns  probabilistic  relationships  among  attributes  in  the  form  of  directed  acyclic 
graph  (DAG)  from  the  training  data.  In  a  graph,  each  node  is  probabilistically  independent  of 
its  non-descendants  given  the  state  of  its  parents.  At  each  node,  joint  probabilities  with  respect 
to  its  parents  are  learned  from  the  training  data.  In  many  implementations,  the  continuous¬ 
valued  attributes  are  discretised.  The  joint  probability  p(x,  y)  is  estimated  as: 


d 


p(x,  y)  =  p(y\7ty)  ]^[  p(xj\7Tj) 
7=1 


(6) 


where  jzj  is  parent(xj )  and  jcy  is  parent(y). 

AODE  allows  conditional  dependence  with  class  and  one  ‘privileged’  attribute,  and  other 
attributes  are  conditionally  independent  given  the  class  label  y  and  a  privileged  attribute  Xj . 
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The  conditional  probabilities  are  computed  as  follows: 

d 


p(x\xi,y)  =  ]^[  p(xj\xi,  y) 
j=  1 


(V) 


As  AODE  cannot  handle  continuous-valued  attributes,  they  are  discretised,  and  the  condi¬ 
tional  probabilities  are  estimated  by  the  proportion  of  instances  having  values  in  an  interval 
belonging  to  class  y. 

In  contrast  to  the  existing  implementation  of  Bayesian  classifiers,  the  implementation 
based  on  DEMass  estimates  p  (x|  y )  directly,  without  any  assumptions.  In  order  to  use  DEMass 
in  classification,  we  made  the  following  adjustments  to  estimate  p(x|y). 

•  Instead  of  constructing  t  trees  in  total,  t  trees  per  class  are  constructed  to  estimate  the 
density  distribution  of  each  class  separately.  The  instances  in  each  data  subset  D,  C  D  are 
separated  according  to  their  class  labels  into  c  subsets  where  c  is  the  number  of  classes, 
yielding  Vjy(y  =  1 ,  2,  . . . ,  c);  (J y  D;',y  =  D,  and  \Vi\  =  \[r .  A  tree  is  constructed  from 
Vi  y  to  represent  regions  Tjy(-).  Hence,  a  total  of  ct  trees  are  constructed. 

•  Instead  of  growing  the  tree  to  the  maximum  height,  we  stop  growing  when  the  number 
of  instances  in  a  node  is  less  than  or  equal  to  one.  This  provides  a  smoother  estimation. 

We  compute  the  class  conditional  probability  based  on  DEMass  as: 


(8) 


where  Vly  is  a  subset  of  samples  belonging  to  class  y  in  Vj  and  u;i3,  is  the  volume  of  the 
region  Tiyy(x). 

The  rest  of  the  steps  in  DEMass-Bayes  are  the  same  as  existing  Bayesian  classifiers.  The 
prior  probabilities  p(y)  are  calculated  from  the  training  data.  Finally,  Bayes  rule  is  used  to 
predict  the  class  which  has  the  maximum  posterior  p(y\x). 


y  =  argmax  p(y)  x  p(x\y) 


(9) 


y 


In  case  of  a  tie,  a  random  prediction  is  made  between  the  classes  yielding  the  equal 
maximum  posterior  probabilities. 

The  decision  rules  of  existing  Bayesian  classifiers  and  DEMass-Bayes  are  provided  in 
Table  4.  The  existing  Bayesian  classifiers  estimate  the  conditional  probability  as  the  product  of 
one-dimensional  likelihoods  as  shown  in  Table  4.  In  contrast,  the  proposed  Bayesian  classifier, 
DEMass-Bayes,  does  not  make  any  explicit  assumptions  and  estimates  the  multidimensional 
likelihood  directly  from  the  training  data. 

The  time  and  space  complexities  of  existing  Bayesian  classifiers  and  DEMass-Bayes  are 
provided  in  Table  5.  DEMass-Bayes  uses  subsets  of  the  given  training  set  to  construct  h:d- 
Trees  in  order  to  estimate  mass.  Its  training  time  is  constant  as  shown  in  Table  5.  But, 
the  training  time  of  the  other  Bayesian  classifiers  is  linear  in  n.  Also,  the  proposed  method 
has  constant  space  complexity.  Hence,  the  proposed  method  scales  better  than  the  existing 
Bayesian  classifiers  in  big  data. 

6  Empirical  evaluation 

The  evaluations  in  clustering  and  anomaly  detection  tasks  are  conducted  in  the  unsupervised 
learning  setting,  whereas  classification  task  in  the  supervised  setting.  We  compare  DBSCAN 
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Table  4  Decision  rules  of  existing  Bayesian  classifiers  and  DEMass-Bayes 


Classifier  Decision  rule  Remarks 


d 


NB-GD 

argmax  p(y)Y\  p{xt\y) 

y  ■  i 

p(xi  |y)  is  estimated  with  normal  distribution 

NB-KDE 

p(xj  |y)  is  estimated  with  KDE 

NB-Disc 

p{xi  |y)  is  estimated  through  discretisation 

BayesNet 

arg  max  p{ity,y)\\p (xt  | tt;  ,  y ) 

7i i  =  parental),  7Ty  =  parent(y )  Joint 

y  i 

1=  1 

probabilities  are  estimated  through 

discretisation 

AODE 

arg  max  y  p(x; ,  y)  ]~[  p(xj\xt,y) 

p(xj\xi,y)  in  Eq.  (7)  is 

y  i= 1  7=1 

estimated  through  discretisation 

DEMass-Bayes 

argmax  p(y)  p(x\y) 

y 

p(x|y)  is  estimated  using  Eq.  (8) 

Note  that  with  the  exception  of  DEMass-Bayes,  all  existing  Bayesian  classifiers  estimate  one-dimensional 

likelihoods 


Table  5  Time  and  space  complexities  of  existing  Bayesian  classifiers  and  DEMass-Bayes 


Classifier 

Time  complexity 

Training 

Testing 

Space  complexity 

NB-GD3 

O(nd) 

O(cd) 

0{cd) 

NB-KDE3 

0(nd ) 

0(cmd ) 

O(cmd) 

NB-Discb 

0(nd) 

O(cd) 

0(cdu ) 

AODEb 

0{n(fl) 

0(cd2) 

0(c(du )2) 

DEMass-Bayes 

0(cthd(p ) 

0{cthd ) 

O(ctdip) 

n:  total  number  of  training  instances,  m:  average  number  of  training  instances  in  a  class,  <7:  number  of 
dimensions,  c:  number  of  classes,  u  \  average  number  of  discrete  values  of  an  attribute,  t  :  number  of  trees,  and 
cp:  average  number  of  samples  per  class  in  T>j 
3  Langley  and  John  [22],  bWebb  et  al.  [38] 


with  DEMass-DBSCAN  in  the  first  subsection  and  then  compare  LOF  with  DEMass-LOF 
in  the  second  subsection.  Finally,  we  compare  DEMass-Bayes  with  five  existing  Bayesian 
classifiers,  namely  NB-GD,  NB-KDE,  NB-Disc,  BayesNet  and  AODE  in  the  last  subsection. 

All  experiments  for  clustering  and  anomaly  detection  tasks  were  conducted  as  single¬ 
thread  jobs  processed  at  2.3  GHz  in  a  Linux  cluster  (www.vpac.org)  using  a  node  with  32 
GB  memory,  whereas  all  the  experiments  for  classification  task  were  conducted  as  single¬ 
thread  jobs  using  a  node  in  Linux  cluster  with  2.27  GHz  and  120  GB  memory. 

All  DEMass-based  algorithms  were  written  in  JAVA  in  WEKA  platform  [39],  so  as 
DBSCAN  and  existing  Bayesian  classifiers.  LOF  was  written  in  Java  in  ELKI  platform 
version  0.4  [1], 

The  data  sets  used  are  from  UCI  Machine  Learning  Repository  [14],  unless  stated  other¬ 
wise.  Only  data  sets  more  than  10,000  instances  are  used  in  order  to  examine  the  algorithms’ 
capability  to  deal  with  large  data  sets. 

The  clustering  result  was  reported  in  terms  of  CPU  runtime  (in  seconds),  number  of 
clusters  identified,  number  of  unassigned  instances  and  F-measure  which  was  calculated 
based  on  assigned  instances  only.  F-measure  =  1  when  all  assigned  instances  are  in  the 
correct  clusters,  i.e.,  perfect  clustering,  and  F-measure  =  0  if  all  instances  are  assigned  to 
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Table  6  Data  sets  used  for  the  clustering  task  for  comparing  DEMass-DBSCAN  with  DBSCAN 


Data  sets 

Size  n 

#d 

#Clusters 

RingCurve- Wave-TriGaussian3D 

70,000 

3 

7 

RingCurve- Wave-TriGaussian48D 

70,000 

48 

7 

OneBig 

68,000 

20 

9 

Pendigits 

10,992 

16 

10 

wrong  clusters.  The  anomaly  detection  result  was  reported  in  terms  of  CPU  runtime  and  AUC 
(Area  Under  ROC  Curve)  based  on  the  ranked  result.  The  classification  result  was  reported 
in  terms  of  classification  accuracy  and  CPU  runtime  (in  seconds).  We  tuned  the  parameters 
of  each  algorithm  in  the  unsupervised  learning  setting  and  reported  the  best  result.  In  the 
supervised  learning,  the  default  parameter  settings  were  used  for  all  the  classifiers  unless 
specified  otherwise  and  reported  the  average  classification  accuracy  and  average  runtime 
over  a  10-fold  cross-validation. 

6. 1  DEMass-DBSCAN  versus  DBSCAN 

DEMass-DBSCAN  had  i/r  =  256  and  t  =  1,000  as  default,  and  both  DEMass-DBSCAN 
and  DBSCAN  used  Mi n  Pts  =  6  in  all  experiments.  As  a  result,  only  one  parameter  needed 
to  be  tuned  for  a  particular  data  set:  h  for  DEMass-DBSCAN  and  e  for  DBSCAN. 

Table  6  provides  the  properties  of  the  four  data  sets  used.  RingCurve- Wave-TriGaussian3D, 
RingCurve-Wave-TriGaussian48D  and  OneBig  are  the  three  largest  data  sets  as  used  in  [34], 
and  we  use  an  additional  data  set,  Pendigits,  which  has  more  than  10,000  instances. 

RingCurve-Wave-TriGaussian  consists  of  three  two-dimensional  synthetic  data: 
RingCurve,  Wave  and  TriangularGaussian  as  shown  in  Fig.  10  in  the  ‘Appendix’,  embedded 
in  either  a  3-dimensional  data  set  or  a  48-dimensional  data  set  (where  42  dimensions  are 
irrelevant  with  a  constant  value).  There  are  a  total  of  seven  clusters  with  10,000  instances 
in  each  cluster.  In  OneBig  [26],  the  biggest  cluster  has  50,011  instances,  and  each  of  the 
other  eight  clusters  has  approximately  1,000  instances.  In  addition,  there  are  10,000  noise 
instances  randomly  distributed  in  the  feature  space.  Each  of  the  ten  clusters  in  Pendigits  has 
approximately  1,000  instances. 

The  clustering  results  from  DEMass-DBSCAN  and  DBSCAN  are  shown  in  Table  7. 
DEMass-DBSCAN  ran  faster  than  DBSCAN  by  a  factor  more  than  17  in  both  3-dimensional 
and  48-dimensional  RingCurve-Wave-TriGaussian  data  sets.  In  terms  of  #clusters  and  #unas- 
signed,  DEMass-DBSCAN  performed  slightly  worse  than  DBSCAN  in  the  3-dimensional 
data  set,  but  better  in  the  48-dimensional  data  set.  DEMass-DBSCAN  decreased  its  number 
of  unassigned  instances  from  535  to  61  when  the  number  of  dimensions  was  increased  from 
3  to  48,  whereas  DBSCAN  had  the  same  332  unassigned  instances  in  both  cases.  DEMass- 
DBSCAN  performs  either  similarly  to  or  better  than  DBSCAN  in  terms  of  F-measure  in 
these  two  data  sets. 

DEMass-DBSCAN  and  DBSCAN  for  OneBig  had  the  same  clustering  result  in  terms  of 
F-measure  and  number  of  clusters,  but  DEMass-DBSCAN  ran  faster  than  DBSCAN  by  a 
factor  of  7.  Note  that  DEMass-DBSCAN  had  correctly  identified  all  but  one  of  the  10,000 
noise  instances,  whereas  DBSCAN  correctly  identified  all  of  the  noise  instances.  In  Pendigits, 
the  result  showed  that  although  DEMass-DBSCAN  had  a  lower  F-measure  than  DBSCAN, 
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Table  7  Clustering  results  of  DEMass-DBSCAN  (h  =  7  for  RingCurve-Wave-TriGaussian3D,  h  =  6  for 
RingCurve-Wave-TriGaussian48D,  h  =  3  for  OneBig  and  h  =  2  for  Pendigits)  and  DBSCAN  (e  =  0.01  for 
RingCurve-Wave-TriGaussian3D  and  RingCurve-Wave-TriGaussian48D,  €  =  0.1  for  OneBig  and  e  =  0.2 
for  Pendigits) 

RingCurve-Wave-TriGaussian3D  RingCurve-Wave-TriGaussian48D 


DEMass-DBSCAN 

DBSCAN 

DEMass-DBSCAN 

DBSCAN 

Runtime 

135 

2,391 

1,261 

21,906 

#Cluster 

9 

8 

7 

8 

#Unassigned 

535 

332 

61 

332 

F -measure 

0.9999 

0.9999 

1.0000 

0.9999 

OneBig 

Pendigits 

DEMass-DBSCAN 

DBSCAN 

DEMass-DBSCAN 

DBSCAN 

Runtime 

1.145 

8,544 

91 

204 

#Cluster 

9 

9 

47 

65 

#Unassigned 

10,021 

10,005 

2,166 

6,251 

F -measure 

1.00 

1.00 

0.65 

0.75 

Fig.  4  Scale-up  test: 
DEMass-DBSCAN  vs  DBSCAN 
in  the  48-dimensional 
RingCurve-Wave-TriGaussian 
data  set.  Note  that  DBSCAN 
completed  the  task  of  the  one 
million  data  set  (at  data  size 
ratio  =150)  in  36  days  versus 
DEMass-DBSCAN  4.5  h.  Even 
with  the  10  million  data  set, 
DEMass-DBSCAN  completed  it 
in  38  h,  but  it  is  infeasible  to  run 
DBSCAN 
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10  75  150 
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it  was  better  than  DBSCAN  in  all  other  measures:  it  had  only  20%  instances  unassigned, 
whereas  DBSCAN  had  57  %  instances  unassigned;  DEMass-DBSCAN  found  47  cluster, 
whereas  DBSCAN  detected  65. 

In  order  to  examine  how  well  the  algorithms  scale  up  to  large  data  sets,  we  used  the 
48-dimensional  RingCurve-Wave-TriGuassian  data  set  and  increased  the  data  size  from  7,000 
to  70,000,  half-a-million,  1  million  and  10  million.  Figure  4  plotted  runtime  ratio  versus  data 
size  ratio  (1,  10,  75,  150  and  1,500)  by  using  7,000  as  the  base.  The  result  showed  that 
DEMass-DBSCAN  had  a  sub-linear  increase  in  runtime:  the  runtime  ratio  increased  from  1 
to  101  when  the  data  size  ratio  increased  from  1  to  150.  In  contrast,  DBSCAN’s  runtime  ratio 
increased  from  1  to  18,000  with  the  same  increase  in  data  size  ratio.  DEMass-DBSCAN  was 
faster  than  DBSCAN  by  a  factor  of  193  when  the  one  million  data  set  is  used.  Even  the  data 
size  was  increased  by  a  factor  of  1,500,  and  the  runtime  of  DEMass-DBSCAN  increased  by 
a  factor  of  862  only. 
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Table  8  Data  sets  used  for  the  anomaly  detection  task  for  comparing  DEMass-LOF  with  LOF 

Data  sets 

Size  n 

#d 

Anomaly  class 

Http 

567,497 

3 

Attack  (0.4%) 

ForestCover  (FC) 

286,048 

10 

Class  4  (0.9  %)  versus  class  2 

Mulcross 

262,144 

4 

2  Clusters  (10%) 

Smtp 

95,156 

3 

Attack  (0.03  %) 

Shuttle 

49,097 

8 

Classes  2,3,5,6,7  (7  %) 

Table  9 

Compare  LOF  and  DEMass-LOF  in 

terms  of  AUC  (Area  Under  ROC  Curve)  and  time  (in  seconds) 

AUC 

Time  (s) 

DEMass-LOF 

LOF 

DEMass-LOF 

LOF 

h  =  1 

h  =  4 

k  =  10 

k  =  60 

h  =  1 

h  =  4 

k  =  10 

II 

On 

O 

Http 

0.99 

0.93 

0.44 

0.35 

19 

42 

18,913 

19,818 

FC 

0.74 

0.77 

0.57 

0.58 

39 

40 

10,835 

11,147 

Mulcross 

0.96 

0.09 

0.59 

0.59 

12 

53 

5,432 

5,486 

Smtp 

0.29 

0.89 

0.32 

0.85 

2 

5 

540 

552 

Shuttle 

0.94 

0.71 

0.55 

0.62 

5 

12 

368 

380 

AUC  =  1  is  the  perfect  detection  performance  and  AUC  =  0  is  the  worst.  The  default  settings  for  DEMass-LOF 
were  h  =  1,  ifr  =  256  and  t  =  100  which  were  used  for  all  data  sets.  The  parameters  k  (for  LOF)  and  h  (for 
DEMass-LOF)  were  changed  in  order  to  explore  a  better  result 


6.2  DEMass-LOF  versus  LOF 

For  anomaly  detection  tasks,  we  compare  LOF  with  DEMass-LOF  in  this  section.  Table  8 
provides  the  properties  of  the  data  sets  used.  These  are  the  five  largest  data  sets  used  by  [35]. 
Note  that  Http  and  Smtp  are  subsets  of  the  network  intrusion  data  set  used  in  KDDCUP  99 
[40]  and  an  anomaly  data  generator,  ‘Mulcross’  [27],  is  used  to  generate  a  synthetic  data 
set.  All  the  data  sets  used  have  nearly  fifty  thousand  or  more  instances,  with  the  largest 
up  to  half-a-million  instances.  The  default  settings  for  DEMass-LOF  were  i/r  =  256  and 
t  =  100. 

Table  9  compares  LOF  with  DEMass-LOF  in  terms  of  detection  performance  AUC  and 
time.  DEMass-LOF  using  either  h= 1  or  4  obtained  better  AUC  results  than  LOF.  It  is  inter¬ 
esting  to  note  that  DEMass-LOF  achieved  extreme  results  in  the  Smtp  and  Mulcross  data 
sets  between  the  two  h  settings,  and  it  behaved  differently  in  these  two  data  sets,  where  a  low 
h  setting  is  better  in  Mulcross  but  a  high  h  setting  is  better  in  Smtp.  This  is  because  the  two 
data  sets  have  two  different  types  of  anomalies:  clustered  and  scattered  anomalies  [24,25]. 
Mulcross  has  clustered  anomalies,  i.e.,  outlying  clusters  with  high  density  but  a  small  num¬ 
ber  of  instances.  DEMass-LOF  with  a  high  h  setting  (i.e.,  h  =4)  regarded  these  anomaly 
clusters  more  ‘normal’  than  normal  instances,  which  was  reflected  in  the  result:  AUC  =  0.09. 
In  contrast,  the  Smtp  data  set  has  scattered  anomalies  which  are  isolated  outlying  instances 
around  normal  clusters.  This  scenario  requires  a  high  h  setting  in  order  for  DEMass-LOF  to 
compute  the  right  densities  for  these  anomalies. 

LOF  was  not  competitive,  and  the  AUC  results  did  not  change  much  from  the  presented 
results  even  other  k  values  were  used  (we  had  tried  k  =  30,  40,  50,  80,  100,  120.) 


Springer 


DEMass:  a  new  density  estimator  for  big  data 


Fig.  5  Scale-up  test:  LOF  versus 
DEMass-LOF  in  Mulcross.  The 
base  for  data  size  ratio  is  8,192 
instances,  and  the  base  for 
runtime  ratio  is  the  runtime  on 
8,192  instances 
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However,  it  shall  be  noted  that  LOF  could  achieve  good  detection  accuracy  with  an  appro¬ 
priate  k.  For  example,  LOF  obtained  AUC  =  0.99  when  k  =  4,  000  was  used  in  the  shuttle  data 
set.  But  similar  search  in  the  largest  three  data  sets  failed  with  out-of-memory  problem  even 
though  the  computer  system  was  allocated  32  GB  memory!  This  result  reveals  two  universal 
problems  with  k-NN  approaches  like  LOF:  (i)  an  extensive  parameter  search  is  required  to 
obtain  good  detection  accuracy;  this  search  adds  a  significant  cost  to  the  already  long-runtime 
process.  The  total  time  cost  is  often  prohibitive;  and  (ii)  high  memory  requirement. 

Table  9  also  compares  these  detectors  in  terms  of  processing  time.  DEMass-LOF  was  one 
to  three  orders  of  magnitude  faster  than  LOF  in  these  data  sets. 

Figure  5  shows  the  runtime  of  both  algorithms  when  scaling  from  8,192  instances  up  to  a 
million  instances  in  the  Mulcross  data  set.  The  data  size  was  increased  by  a  factor  of  16,  32, 
64,  128  from  8,192  instances.  DEMass-LOF  increased  its  runtime  by  a  factor  of  1 1,  23,  25 
and  86,  respectively.  In  contrast,  LOF  increased  its  runtime  by  a  factor  of  217,  845,  2,371  and 
1 1,173,  respectively.  At  data  size  ratio  =  128,  which  has  a  million  instances,  LOF  completed 
the  task  in  28  h  whereas  DEMass-LOF  accomplished  it  in  45  seconds! 

6.3  DEMass-Bayes  versus  Bayesian  classifiers 

In  this  subsection,  we  compare  the  performance  of  DEMass-Bayes  with  five  existing  Bayesian 
classifiers:  NB-GD,  NB-KDE,  NB-Disc,  BayesNet  and  AODE. 

For  better  estimation  of  multidimensional  density,  we  need  sufficient  training  data.  Hence, 
we  chose  large  data  sets  with  size  n  >  10,000.  We  tested  on  10  data  sets  with  different  sizes, 
dimensions,  number  of  classes  and  class  distributions.  The  properties  of  the  data  sets  are 
provided  in  Table  10. 

Out  of  10  data  sets  used,  Wave,  RingCurve  and  OneBig  [26]  are  synthetic  and  the  rest 
are  the  real  data  sets  from  UCI  Machine  Learning  Repository  [14].  RingCurve  and  Wave 
are  the  subsets  of  RingCurve- Wave-TriGaussian  data  set  shown  in  the  ‘Appendix*.  OneBig 
and  Pendigits  are  the  same  data  sets  as  used  in  Sect.  6.1.  In  OneBig,  noise  in  the  data  set 
is  treated  as  a  separate  class;  hence,  it  has  10  classes.  Magic04  has  two  classes  with  12,332 
and  6,688  instances,  and  Mammography  has  two  classes  with  10,923  and  260  instances. 
Letters  is  a  data  set  of  26  characters  (classes)  with  approximately  750  data  instances  in 
each  class.  Out  of  seven  classes  in  Shuttle,  approximately  80%  of  the  data  belongs  to  the 
first  class  whereas  the  smallest  class  has  10  instances  only.  MiniBooNE  is  a  dataset  from  an 
experiment  to  distinguish  electron  neutrinos  (signal)  from  muon  neutrinos  (background).  The 
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Table  10  Data  sets  used  in 
classification  task  to  compare  the 
performance  of  DEMass-Bayes 
with  five  existing  Bayesian 
classifiers 


Data  sets 

Size  n 

#d 

#c 

CoverType 

581,012 

10 

7 

MiniBooNE 

129,596 

50 

2 

OneBig 

68,000 

20 

10 

Shuttle 

58,000 

8 

7 

Letters 

20,000 

16 

26 

RingCurve 

20,000 

2 

2 

Wave 

20,000 

2 

2 

Magic04 

19,020 

10 

2 

Mammography 

11,183 

6 

2 

Pendigits 

10,992 

16 

10 

class  distribution  is  approximately  7:3.  CoverType  is  a  data  set  of  forest  cover  type  with  seven 
classes.  It  is  the  biggest  data  set  used,  having  more  than  half-a-million  instances.  The  class 
distribution  is  unbalanced  as  two  classes  have  more  than  two  hundred  thousand  instances 
each  and  the  smallest  class  has  2,747  instances. 

In  classification,  it  is  important  to  estimate  the  density  distribution  of  classes  in  the  data 
space  as  good  as  possible  in  order  to  separate  them.  To  achieve  better  classification  accuracy, 
more  samples  are  required  to  grow  trees  further  to  capture  the  detailed  information  about 
the  local  class  distributions.  The  larger  the  sample  size  to  build  Ir.cl-Trees,  the  better  the 
density  estimation.  Hence,  two  variants  of  DEMass-Bayes  are  used:  one  employs  a  fixed-size 
sub- sample,  and  the  other  uses  the  entire  training  data. 

1.  DEMass-Bayes:  A  sub-sample  V  C  D  (|  V\  =  ijr  <  n)  was  used  to  build  each  lv.d-T ree. 

The  sub-sampling  size  (i (r)  was  set  to  10,000  as  default. 

2.  DEMass-Bayes':  The  entire  training  set  (i/r  =  n )  was  used  to  build  each  h:d-Tree. 

The  other  two  parameters  h  and  1  were  set  as  default  to  10  and  100,  respectively. 

All  the  other  algorithms  were  executed  with  the  default  parameter  settings  except 
BayesNet.  For  BayesNet,  the  parameter  ‘ maximum  number  of  parents’  was  set  to  100  to 
examine  whether  a  large  number  of  parents  produces  better  accuracy,  and  the  parameter  ‘  ini¬ 
tialise  as  naive  Bayes ’  was  set  to  ‘false’  to  initialise  an  empty  network  structure.  The  other 
parameters  were  set  to  default  values. 

We  normalised  the  data  in  the  range  of  [0-1]  to  avoid  attributes  with  large  values  affecting 
volume  calculation. 

Since  AODE  cannot  handle  continuous-valued  attributes,  we  discretised  the  attributes 
using  the  method  proposed  in  [13].  BayesNet  does  discretisation  before  building  the  classi¬ 
fication  model. 

We  performed  10-fold  cross-validation  and  reported  the  average  accuracy  and  average 
runtime.  The  classification  accuracies  (%)  and  runtime  (seconds)  are  provided  in  Tables  1 1 
and  13,  respectively. 

A  statistical  test  based  on  two  standard  errors  was  performed  to  examine  whether  the  dif¬ 
ference  in  classification  accuracies  between  two  classifiers  is  significant.  The  win:loss:draw 
counts  of  both  variants  of  DEMass-Bayes  against  the  existing  Bayesian  classifiers  are  reported 
in  Table  12.  A  win  or  loss  was  counted  if  the  difference  was  significant;  otherwise,  it  was  a 
draw. 
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Table  11  Average  classification  accuracies  (%)  over  a  10-fold  cross-validation  for  DEMass-Bayes',  DEMass- 
Bayes  and  existing  Bayesian  classifiers:  BayesNet,  AODE,  NB-KDE,  NB-GD  and  NB-Disc 

Data  sets 

DEMass-Bayes' 

DEMass-Bayes 

Bayes  Net 

AODE 

NB-KDE 

NB-GD 

NB-Disc 

CoverType 

92.  05* 

81. 421 

87.54 

72.89 

66.72 

63.05 

66.56 

MiniBooNE 

88.13t 

ss.so1' 

90.19 

89.58 

86.06 

83.40 

86.18 

OneBig 

99.93t 

99.93f 

99.99 

99.69 

99.98 

99.89 

99.98 

Shuttle 

99.86 t 

99M* 

99.92 

99.85 

92.67 

85.66 

94.40 

Letters 

92.27* 

91.71* 

86.76 

88.81 

74.20 

64.01 

74.01 

RingCurve 

100.00* 

100.00* 

99.96 

99.98 

99.26 

90.11 

99.38 

Wave 

100.00* 

100.00* 

78.27 

78.50 

77.91 

66.80 

78.27 

Magic04 

84.17* 

83.23 

83.36 

83.00 

76.12 

72.69 

77.78 

Mammography 

98.64 

98.64 

98.48 

98.42 

97.86 

95.68 

97.50 

Pendigits 

98.87* 

98.87* 

96.56 

97.84 

88.64 

85.75 

87.78 

Average 

95.39 

93.92 

92.10 

90.86 

85.94 

80.70 

86.18 

*(^)  represents  the  significantly  better  (worse)  predictive  accuracy  of  DEMass-Bayes'  or  DEMass-Bayes  over 
the  best  accuracy  of  the  existing  Bayesian  classifiers  based  on  the  two-standard-error  significant  test 
Boldface  represents  the  best  accuracy  achieved  in  each  data  set 


Table  12  Win:Loss:Draw  counts 
of  DEMass-Bayes'  and 
DEMass-Bayes  against  the  other 
contenders  in  terms  of  accuracy 
based  on  the  two-standard-error 
significance  test 


Contenders 

DEMass-Bayes' 

DEMass-Bayes 

BayesNet 

6:3:1 

4:4:2 

AODE 

7:1:2 

6:1:3 

NB  KDE 

9:1:0 

8:2:0 

NB-GD 

10:0:0 

10:0:0 

NB-Disc 

9:1:0 

8:2:0 

The  results  in  Tables  11  and  12  show  that  DEMass-Bayes'  yielded  better  classification 
accuracies  in  most  of  the  data  sets.  Even  the  sub-sample  version  DEMass-Bayes  produced 
competitive  classification  accuracies  to  the  existing  Bayesian  classifiers.  DEMass-Bayes'  had 
six  wins,  three  losses  and  one  draw  against  BayesNet;  seven  wins,  one  loss  and  two  draws 
against  AODE;  nine  wins  and  one  loss  against  NB-KDE  and  NB-Disc;  and  all  ten  wins  over 
NB-GD.  Similarly,  DEMass-Bayes  had  four  wins  and  four  losses  against  BayesNet;  six  wins 
and  one  loss  against  AODE;  eight  wins  and  two  losses  against  NB-KDE  and  NB-Disc;  and 
all  ten  wins  against  NB-GD. 

Both  variants  of  DEMass-Bayes  outperformed  existing  Bayesian  classifiers  in  four  data 
sets  namely  Pendigits,  Wave,  RingCurve  and  Letters.  In  case  of  Wave  and  Letters,  they  had 
large  improvement  in  accuracy  over  existing  Bayesian  classifiers  by  20  and  5  %,  respectively. 
They  had  slightly  poorer  accuracy  than  the  best  existing  Bayesian  classifier  (BayesNet)  in 
three  data  sets — MiniBooNE,  OneBig  and  Shuttle. 

Wave  is  a  synthetic  data  set  of  two  parallel  waves  as  shown  in  Fig.  10b  in  the  ‘Appendix’ . 
In  this  data  set,  the  prediction  decision  of  NB-GD  and  NB-KDE  depends  on  the  likelihood 
on  the  y-dimension  only  as  the  classes  are  equally  likely  everywhere  on  the  x  -dimension. 
The  same  applies  to  NB-Disc,  AODE  and  BayesNet  which  used  a  supervised  discretisation 
method;  as  a  result,  the  values  in  the  v-dimension  cannot  be  discretised,  and  they  are  grouped 
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Table  13  Average  runtime  (in  seconds)  over  a  10-fold  cross-validation  for  DEMass-Bayes',  DEMass-Bayes 
and  existing  Bayesian  classifiers:  BayesNet,  AODE,  NB-KDE,  NB-GD  and  NB-Disc 


Data  sets 

DEMass- 
B  ayes' 

DEMass- 

Bayes 

Bayes 

Net 

AODE 

NB-KDE 

NB-GD 

NB-Disc 

CoverType 

1021.5 

163.0 

403.6 

10.8 

96.3 

15.3 

55.2 

MiniBooNE 

474.6 

91.8 

324.2 

11.8 

831.6 

17.0 

45.1 

OneBig 

122.3 

22.3 

432.5 

3.9 

253.0 

3.2 

11.9 

Shuttle 

35.0 

17.5 

8.4 

0.9 

1.5 

0.9 

2.7 

Letters 

48.2 

22.2 

5.5 

0.8 

2.5 

0.9 

1.2 

RingCurve 

6.4 

5.7 

0.3 

0.2 

2.4 

0.2 

0.2 

Wave 

6.8 

5.8 

0.3 

0.2 

2.5 

0.1 

0.3 

Magic04 

13.5 

10.7 

1.8 

0.4 

8.8 

0.3 

0.9 

Mammography 

4.7a 

6.7a 

0.3 

0.3 

0.5 

0.1 

0.2 

Pendigits 

12. 5b 

12. 5b 

2.3 

0.4 

1.6 

0.3 

0.6 

a  In  each  fold  of  a  10-fold  cross-validation  of  Mammography,  there  are  10,065  instances  in  training  set. 
DEMass-Bayes  samples  10,000  from  10,065  instances  to  build  each  tree,  whereas  DEMass-Bayes'  uses  the 
entire  10,065  instances.  The  tree-building  time  of  DEMass-Bayes  (with  10,000  instances)  and  DEMass- 
Bayes'  (with  10,065  instances)  is  almost  the  same.  DEMass-Bayes  is  slower  than  DEMass-Bayes'  because  of 
the  sampling  time 

b  In  case  of  Pendigits,  both  DEMass-Bayes  and  DEMass-Bayes'  use  the  entire  training  set  of  9893  instances 
in  each  fold  to  construct  each  tree  as  i// (=  10,000)  >  n(=9,893) 


as  a  single  block.  The  accuracies  of  ADOE  and  BayesNet  were  increased  to  97.35  and 
97.34%,  respectively,  with  unsupervised  10-bin  equal-frequency  discretisation  [7],  but  they 
were  still  significantly  worse  than  that  of  DEMass-Bayes.  DEMass-Bayes,  which  estimates 
the  multidimensional  likelihood  directly  considering  both  the  dimensions  at  once,  models 
the  distribution  well.  Hence,  it  produced  significantly  better  accuracy  than  the  other  Bayesian 
classifiers  in  the  Wave  data  set. 

The  biggest  difference  between  DEMass-Bayes  and  DEMass-Bayes'  was  observed  in 
the  CoverType  data  set.  Even  in  this  data  set,  DEMass-Bayes  produced  significantly  better 
classification  accuracy  than  the  existing  Bayesian  classifiers  except  BayesNet.  It  produced 
lower  accuracy  than  BayesNet  because  the  sample  size  was  not  enough.  More  samples  are 
required  to  grow  the  trees  further  to  model  the  distribution  well  if  the  class  distribution  in  the 
feature  space  is  complex.  A  detailed  discussion  will  be  provided  in  Sec.  6.3.2. 

Table  13  shows  that  DEMass-Bayes  was  generally  slower  than  the  other  Bayesian  classi¬ 
fiers  in  terms  of  runtime  in  smaller  data  sets.  However,  it  should  be  noted  that  the  existing 
Bayesian  classifiers  assume  some  kind  of  conditional  independence  and  estimate  the  simpli¬ 
fied  surrogates  of  p(x\ y)  one  dimension  at  a  time.  In  contrast,  the  proposed  method  estimates 
p(x\ v)  in  multidimensional  space  directly  from  the  given  training  data  without  making  any 
explicit  assumptions.  Nevertheless,  the  sub-sample  version  DEMass-Bayes  was  an  order  of 
magnitude  faster  than  BayesNet  and  NB-KDE  in  some  large  data  sets  such  as  MiniBooNE  and 
OneBig;  DEMass-Bayes  and  BayesNet  were  in  the  same  order  of  magnitude  in  CoverType. 
Similarly,  DEMass-Bayes',  BayesNet  and  NB-KDE  were  in  the  same  order  of  magnitude  in 
the  MiniBooNE  and  OneBig  data  sets.  Note  that  the  presented  runtime  results  for  AODE  did 
not  include  the  discretisation  time  that  was  done  as  a  preprocessing  step.  The  discretisation 
time  was  significantly  large  in  large  data  sets.  For  example,  it  took  52  seconds  in  the  largest 
data  set,  CoverType. 
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Fig.  6  Scale-up  test  for  training 
time:  DEMass-Bayes  versus 
existing  Bayesian  Classifiers  in 
the  48-dimensional 
RingCurve-Wave-TriGaussian 
data  set.  The  base  for  data  size 
ratio  is  70,000  instances,  and  the 
base  for  runtime  ratio  is  the 
runtime  on  70,000  instances.  The 
training  size  ratio  is  on  a 
logarithmic  scale  of  base  10 
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This  runtime  result  does  not  provide  a  full  picture  about  the  time  complexities  of  DEMass- 
Bayes'  and  DEMass-Bayes.  Therefore,  we  had  conducted  a  scale-up  test  of  the  algorithms 
to  present  a  more  accurate  idea  about  the  time  complexity  in  the  following  subsection. 


6.3.1  Scale-up  test 

In  order  to  examine  how  well  the  classifiers  scale  up  to  large  data  sets,  we  used  the 
48-dimensional  RingCurve-Wave-TriGaussian  data  set,  used  in  Sect.  6.1.  Data  size  was 
increased  from  70,000  to  half-a-million,  1  million  and  10  million. 

Figure  6  shows  the  increase  in  training  time  of  both  variants  of  DEMass-Bayes  and  the 
existing  Bayesian  classifiers.  Note  that  AODE  had  an  unfair  advantage  over  the  other  con¬ 
tenders  in  terms  of  runtime  because  it  is  the  only  algorithm  which  does  not  include  the 
additional  discretisation  time  in  the  preprocessing  step  which  increased  linearly  with  the 
training  data  size. 

With  the  increase  in  data  size  by  a  factor  of  7.5,  15  and  150,  DEMass-Bayes  increased 
its  runtime  to  learn  a  classification  model  by  a  factor  of  1.6,  1.7  and  2.1,  respectively.  The 
closest  contender  AODE  increased  its  runtime  by  a  factor  of  6,  11  and  128,  followed  by 
BayesNet  (12,  27,  335),  NB-GD  (12,  24,  386),  NB-Disc  (15,  31, 491)  and  NB-KDE  (16,  24, 
545).  DEMass-Bayes'  increased  its  runtime  by  a  factor  of  9,  23  and  201,  respectively. 

The  training  time  of  DEMass-Bayes  was  constant;  some  fluctuations  in  the  runtime  were 
due  to  the  memory  management  issue  of  Java,  especially  when  the  process  demands  high 
memory.  The  training  time  complexity  of  DEMass-Bayes'  was  slightly  worse  than  that  of 
AODE  but  better  than  that  of  the  other  contenders  (BayesNet  and  NB). 

In  a  nutshell,  DEMass-Bayes  has  a  better  scale-up  capability  than  the  existing  Bayesian 
classifiers  in  big  data.  The  result  is  consistent  with  the  time  complexities  presented  in  Table  5. 


6.3.2  Sensitivity  of  parameters 

In  order  to  examine  the  effect  of  the  parameters,  i.e.,  sample  size  i[r,  height  h  and  number  of 
trees  t,  on  the  classification  accuracy  of  DEMass-Bayes,  we  ran  three  experiments  on  five 
real  large  data  sets,  namely  CoverType,  MiniBooNE,  Shuttle,  Letters  and  Magic04: 

•  Vary  sample  size  i/r  with  a  fixed  number  of  trees  (f  =  100)  and  height  ( h  =  10). 

•  Vary  height  h  with  a  fixed  sample  size  (if  =  10,  000)  and  number  of  trees  (?  =  100). 

•  Vary  number  of  trees  t  with  a  fixed  sample  size  =  10,  000)  and  height  (li  =  10). 
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Fig.  7  Effect  of  parameters  sample  size  (xp),  height  (h)  and  number  of  trees  ft  J  on  the  accuracy  of  DEMass- 
Bayes  in  Shuttle,  Letters,  MiniBooNE,  Magic04  and  CoverType.  The  horizontal  axes  of  sample  size  (i p)  and 
the  number  of  trees  (/)  are  on  logarithmic  scales  of  base  2  and  base  10,  respectively,  in  (a)  and  (c) 


Figure  7  shows  the  effect  of  the  three  parameters  on  the  classification  accuracy  of  DEMass- 
Bayes. 

With  the  increase  in  sample  size  ip  as  shown  in  Fig.  7a,  the  accuracy  increased  up  to  a 
certain  point  {ip  =  10,  000)  and  then  remained  almost  flat  in  all  data  sets,  except  CoverType. 
In  CoverType,  accuracy  kept  increasing  because  the  two  biggest  classes  have  more  than  two 
hundred  fifty  thousand  instances  each  and  the  continuous  accuracy  improvement  is  a  result 
of  improved  accuracy  for  these  two  classes.  These  two  big  classes  are  highly  overlapped. 
The  two  distributions  can  be  better  distinguished  if  the  regions  around  the  overlapped  areas 
are  small.  If  there  are  more  samples  to  grow  trees  further,  better  estimation  for  p(x|y)  can  be 
achieved  and  the  classification  accuracy  can  be  improved.  The  accuracy  was  increased  up  to 
84.87  %  with  i/f  =  20,000  and  88.53%  with  ip  =  50,000.  Figure  8  shows  the  improvement  in 
accuracy  in  CoverType  when  the  sample  size  was  increased.  DEMass-Bayes  using  2,50,000 
instances  produced  the  same  result  as  DEMass-Bayes'  using  the  entire  training  set  of  58 1,012 
instances. 

When  h  was  increased  as  shown  in  Fig.  7b,  the  accuracy  was  increased  initially  and 
then  remained  almost  constant  after  h  =  6.  In  Shuttle,  there  were  small  improvements  until 
h  =  10.  If  there  are  not  enough  samples  to  grow  a  tree  further,  tree  building  stops  early. 
Hence,  increasing  h  does  not  affect  the  performance  if  it  is  already  set  to  a  sufficiently  high 
value. 
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Fig.  8  Increase  in  accuracy  of 
DEMass-Bayes  with  increase  in 
sample  size  in  the  CoverType 
data  set.  Accuracies  are  measured 
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When  the  number  of  trees  t  was  increased  as  shown  in  Fig.  7c,  the  accuracy  increased 
initially  and  remained  constant  after  reaching  a  certain  point  ( t  =  100)  except  in  CoverType 
where  accuracy  kept  increasing.  Increasing  the  number  of  trees  provides  better  approximation 
of  the  local  density  in  the  overlapped  areas  and  improves  the  classification  accuracy  of  the 
two  largest  classes. 

Figure  9  shows  the  effect  of  parameters  on  the  average  runtime  over  a  10-fold  cross- 
validation  in  the  biggest  data  set,  CoverType.  The  runtime  in  the  plots  is  presented  as  a 
runtime  ratio  to  show  the  increase  in  runtime  when  parameters  were  increased.  The  bases 
for  the  runtime  ratio  while  varying  i js,  h  and  t  are  the  total  runtime  (including  training  and 
testing)  for  i fr  =  500,  h  =  2  and  t  =  10,  respectively.  The  runtime  increased  linearly  with  t 
and  sub-linearly  with  i/f.  Since  the  number  of  dimensions  was  10  and  \[r  =  10,000,  the  tree 
building  stopped  early  when  all  the  instances  had  been  separated.  Hence,  as  shown  in  Fig.  9b, 
after  reaching  a  certain  point  ( h  =  6),  further  increase  in  h  did  not  affect  the  runtime. 

7  Discussion 

What  we  have  presented  is  the  first  density  estimation  method  that  utilises  no  distance  mea¬ 
sures.  It  potentially  solves  fundamental  problems  such  as  the  curse  of  dimensionality  in  which 
the  use  of  a  distance  measure  plays  a  key  part  in  creating  the  problem  [4, 18].  Although  the 
current  version  of  DEMass  cannot  deal  with  high-dimensional  problems  because  of  the  grid- 
based  implementation,  a  non-grid-based  implementation  that  utilises  no  distance  measure 
can  potentially  break  the  curse  of  dimensionality. 

There  are  significant  improvements  of  nearest  neighbour  search  in  recent  times.  For  exam¬ 
ple,  indexing  schemes  to  speed  up  nearest  neighbour  search  such  as  Cover  Trees  [5]  and 
M-Trees  [8]  are  claimed  to  have  time  complexity  significantly  better  than  0(n2).  Index¬ 
ing  schemes  such  as  Cover  Trees  and  M-Trees  rely  on  distance-based  pruning  methods  in 
both  the  index  tree  construction  and  range  query  processes.  Distance-based  pruning  methods 
cannot  scale  up  to  massive  data,  and  they  are  known  to  be  inefficient  even  for  a  moderate 
number  of  dimensions.  Thus,  it  is  unlikely  that  any  of  the  recent  indexing  schemes  can  be 
used  to  speed  up  nearest  neighbour  search  to  the  level  that  has  been  achieved  already  by 
DEMass-DBSCAN  and  DEMass-LOF,  especially  in  big  data. 

An  exception  to  the  above  indexing  schemes  is  PINN  (projection-indexed  nearest  neigh¬ 
bours)  [37]  which  employs  Random  Projection  [9,20]  to  project  a  high-dimensional  space 
to  a  low-dimensional  space  in  order  to  get  an  accurate  approximation  for  k-NN  distances 
for  density  calculation  within  LOF.  This  indexing  scheme  has  reduced  LOF’s  time  com- 
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Fig.  9  Effect  of  parameters  sample  size  (tp),  height  (h)  and  number  of  trees  (t)  on  the  runtime  of  DEMass- 
Bayes  in  the  CoverType  data  set.  The  horizontal  axes  of  sample  size  (>Jj  )  and  the  number  of  trees  (f)  are  on 
logarithmic  scales  of  base  2  and  base  10,  respectively,  in  (a)  and  (c).  The  vertical  axis  for  runtime  ratio  when 
varying  t  is  on  a  logarithmic  scale  of  base  10  in  (c) 


plexity  from  0(n 2)  to  sub-quadratic.  It  is  interesting  to  investigate  how  PINN  may  help  to 
enable  DEMass-LOF  to  deal  with  high-dimensional  problems,  when  DEMass-LOF  does  not 
involve  distance  calculation  and  has  already  achieved  sub-linear  time  complexity  without 
any  indexing  scheme. 

Note  that  the  purpose  of  trees  used  in  DEMass  differs  from  that  used  for  Cover  Trees 
or  M-Trees.  Trees  in  DEMass  are  used  to  estimate  mass  and  density,  the  core  computation 
process.  In  contrast,  Cover  Trees  or  M-Trees  are  indices  used  to  speed  up  nearest  neighbour 
search.  The  indices  are  required  because  the  core  computation,  i.e.,  the  requirement  to  cal¬ 
culate  distance  for  every  pair  of  instances,  is  slow.  In  other  words,  one  uses  trees  directly  in 
the  core  process,  and  the  other  uses  trees  to  aid  the  core  process  where  trees  are  not  used  in 
the  actual  computation  of  distance. 

The  cost  of  KDE  estimation  can  be  lowered,  for  example,  by  reducing  the  given  data 
set  D  to  some  ‘representative’  subset,  where  each  representative  kernel  is  derived  from  a 
sub- sample  using  a  maximum  likelihood  method  such  as  an  expectation-maximisation  (EM) 
algorithm  [10,16],  This  reduces  the  KDE  estimation  time,  but  it  comes  with  a  cost  of  an 
expensive  preprocessing  step. 

DENCLUE  [19],  a  generic  density-based  algorithm,  builds  a  density  distribution  from  data 
and  then  uses  a  threshold  to  determine  clusters — all  connected  points  above  the  threshold 
form  a  cluster.  DBSCAN  is  a  special  case  of  DENCLUE.  DEMass-DENCLUE  has  exactly  the 
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same  procedure  as  DEMass-DBSCAN,  where  MinPts  or  the  equivalent  density  threshold 
stated  in  Sec.  5.1  is  employed  as  the  threshold. 

It  is  possible  to  use  neighbours  to  compute  LOF  for  DEMass-LOF.  However,  the  runtime 
advantage  over  LOF  will  be  significantly  reduced  because  of  the  additional  computations 
required  to  calculate  the  density  of  each  neighbour,  even  though  it  does  not  need  to  find 
neighbours  based  on  distance  calculations. 

Other  k-NN-based  anomaly  detectors  (e.g.,  [2,3])  have  employed  some  search  space 
pruning  methods  to  reduce  the  cost  of  the  nearest  neighbour  search  to  achieve  near-linear 
time  complexity.  DEMass  can  be  similarly  applied  in  these  algorithms,  like  what  we  have 
done  to  LOF,  without  distance  calculation,  and  the  resultant  algorithms  will  have  better  time 
and  space  complexities. 

Feature  Bagging  [23]  has  been  proposed  to  improve  the  detection  accuracy  of  LOF  by 
building  multiple  models,  where  each  model  is  constructed  from  a  subset  of  randomly  chosen 
features,  and  the  final  prediction  is  combined  by  averaging  the  anomaly  scores  from  individual 
models.  While  the  method  has  been  shown  to  produce  a  small  improvement  over  a  single 
model  LOF  in  terms  of  AUC,  it  requires  a  substantial  increase  in  runtime  (proportional  to 
the  number  of  models  required).  This  adds  to  the  high  computational  cost  of  LOF  we  have 
already  discussed  in  Sect.  6.2. 

It  is  possible  to  avoid  density  estimation.  For  example,  uLSIF  (unconstrained  least-squares 
importance  fitting)  [17],  OSVM  (one-class  support  vector  machine)  [28]  and  SVDD  (support 
vector  data  description)  [32]  are  anomaly  detectors  which  do  not  employ  density  estimators. 
uLSIF  directly  estimates  the  density  ratio  between  the  training  set  and  test  and  uses  the 
density  ratio  as  the  anomaly  score.  OSVM  and  SVDD  find  the  smallest  region  that  covers  the 
majority  of  the  normal  instances  and  regard  instances  outside  the  region  as  anomalies.  uLSIF 
is  found  to  perform  comparably  with  LOF  and  run  significantly  faster  [17].  It  is  interesting  to 
compare  our  density-based  approach  DEMass-LOF  with  these  non-density-based  approaches 
to  determine  their  relative  strengths  and  weaknesses. 

DEMass-Bayes  is  the  first  Bayesian  classifier  with  the  constant  training  time  complexity 
in  the  number  of  training  instances.  It  is  the  ideal  classifier  for  big  data  and  data  streams 
where  there  are  potentially  infinite  data.  It  is  interesting  to  note  that  DEMass-Bayes  has  the 
flexibility  to  trade  off  between  accuracy  and  the  computational  cost  as  required.  The  time  and 
space  required  can  be  reduced  by  setting  lower  values  for  the  parameters  t  and  ^  if  a  higher 
misclassification  rate  can  be  tolerated.  These  parameters  can  be  set  to  higher  values  in  order 
to  achieve  better  classification  accuracy  at  the  expense  of  higher  computational  cost.  The 
current  implementation  does  not  suit  problems  with  small  data  sets  because  the  estimation 
relies  on  sufficiently  large  sample  in  order  to  model  accurately  the  local  density  distributions 
of  classes  in  the  data  space. 

We  had  also  experimented  a  version  of  DEMass-Bayes  which  samples  i/f  instances  per 
class,  i.e.,  \T>jy\  =  i fr.  With  this  formulation,  the  likelihood  p(\ |y)  is  estimated  as  follows: 


(10) 


where  7),y(.)  is  constructed  from  T>iy,  which  is  a  subset  of  \[r  samples  from  class  y.  This 
formulation  forces  uniform  class  distribution  in  building  Iv.d-Trees.  By  sampling  equal 
number  of  instances  from  every  class,  an  instance  in  smaller  class  is  sampled  more  frequently 
than  the  one  in  larger  class.  The  ‘forced’  uniform  class  distribution  has  distorted  the  accurate 
estimation  of  local  class  density  in  overlapping  regions.  The  formulation  discussed  in  Sect. 
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5.3  employs  approximately  the  original  class  distribution.  It  provides  a  better  estimation  of 
p(x\ y)  than  Eq.  (10)  and  runs  significantly  faster  too. 

DEMass  sets  a  new  benchmark  of  what  density-based  algorithms  can  achieve.  In  contrast 
to  the  density-based  approaches,  mass-based  approaches  [34,35]  solve  problems  without 
the  use  of  a  density  estimator.  Mass-based  approaches  have  been  shown  to  perform  better 
than  the  current  density-based  approaches  in  terms  of  time  and  space  complexities.  It  is  thus 
interesting  to  compare  the  new  benchmark  achieved  by  DEMass-density-based  approaches 
with  mass-based  approaches. 

The  current  implementation  of  DEMass  has  two  limitations.  First,  it  has  step  subdivi¬ 
sions  controlled  by  a  global  parameter  h .  The  limited  possible  steps  may  be  too  coarse  for 
some  applications,  and  the  setting  is  not  adaptive  to  local  variations  in  density.  Second,  the 
grid-based  implementation  carries  all  the  limitations  associated  with  grid-based  approaches, 
especially  dealing  high-dimensional  problems.  All  these  limitations  can  be  overcome  by 
using  a  non-grid  method  which  is  adaptive  to  the  local  data  distribution.  This  non-grid-based 
implementation  will  eliminate  one  global  parameter  and  potentially  tackle  high-dimensional 
problems  more  effectively. 


8  Conclusions  and  future  work 

The  new  density  estimation  method  we  introduced  has  two  unique  features  which  cannot 
be  found  in  existing  density  estimation  methods.  First,  it  is  the  first  density  estimator  that 
utilises  no  distance  measures.  Second,  it  has  average  case  sub-linear  time  complexity  and 
constant  space  complexity  in  the  number  of  instances.  Existing  density  estimators  must  use 
a  distance  measure  and  have  time  and  space  complexities  a  lot  worse  than  linear.  The  time 
and  space  complexities  achieved  set  a  new  benchmark  for  density-based  algorithms,  of  what 
previously  thought  impossible. 

The  asymptotic  analysis  reveals  that  the  new  density  estimator  has  the  same  characteristic 
as  KDE,  i.e.,  both  have  a  smoothing  parameter  used  to  trade-off  between  systematic  error 
(bias)  and  random  error  (variance). 

Making  full  use  of  the  features  in  the  new  density  estimator,  we  show  that  two  current 
algorithms,  in  the  unsupervised  learning  setting  from  two  key  areas  of  data  mining,  can 
be  significantly  simplified  through  set-based  definitions  rather  than  the  current  point-based 
definitions.  This  has  directly  contributed  to  their  significantly  improved  time  complexities. 
In  the  supervised  learning  setting,  DEMass  enables  direct  estimation  of  multidimensional 
likelihood  p(x]y)  for  the  first  time,  without  any  assumptions. 

Our  evaluation  shows  that  the  new  density  estimator  not  only  successfully  replaces  existing 
density  estimators  in  three  density-based  algorithms,  DBSCAN,  LOF  and  Bayesian  classi¬ 
fiers,  but  (a)  reduces  the  runtime  of  DBSCAN  and  LOF  to  become  algorithms  with  sub-linear 
time  complexity  and  (b)  scales  down  existing  Bayesian  classifiers’  best  training  time  com¬ 
plexity  from  linear  to  constant.  In  addition,  DEMass-DBSCAN,  DEMass-LOF  and  DEMass- 
Bayes  often  achieve  equivalent  or  better  task-specific  performances  than  DBSCAN,  LOF  and 
existing  Bayesian  classifiers,  respectively. 

Our  result  implies  that  most,  if  not  all,  density-based  algorithms  can  reap  the  immediate 
benefit  of  significantly  lowering  their  time  complexities  by  simply  replacing  the  existing 
density  estimators  with  the  new  one,  with  a  potential  further  improvement  in  the  task-specific 
performance. 

Future  work  has  three  directions.  First,  we  will  apply  the  new  density  estimator  in  existing 
algorithms  in  more  areas.  We  will  ascertain  whether  there  are  areas  in  which  the  new  density 
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estimator  cannot  replace  existing  density  estimators.  Second,  we  will  compare  DEMass- 
density-based  approaches  with  mass-based  approaches  to  determine  their  relative  strengths 
and  weaknesses.  Third,  we  will  explore  DEMass’s  ability  to  deal  with  high-dimensional 
problems. 
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Appendix:  Data  characteristic  of  the  RingCurve-Wave-TriGaussian  data  set 

The  characteristic  of  the  RingCurve-Wave-TriGaussian  data  set,  used  in  Sect.  5.1,  is  shown 
in  Fig.  10.  Each  of  the  Ring-Curve,  Wave  and  Triangular-Gaussian  is  a  two-dimensional  data 
set,  and  together,  there  is  a  total  of  seven  clusters.  Each  cluster  has  10,000  instances.  When 
used  in  the  scale-up  experiment,  the  data  size  in  each  cluster  was  scaled  by  a  factor  of  0.1,  1, 
75,  150  to  1,500. 


y  v  a  0  0 

(a)  RingCurve  (b)  Wave  (c)  TriGaussian 

Fig.  10  Scatter  plot  of  the  clusters  in  the  RingCurve-Wave-TriGaussian  data  set 
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Abstract.  Existing  generative  classifiers  (e.g.,  BayesNet  and  AnDE) 
make  independence  assumptions  and  estimate  one-dimensional  likeli¬ 
hood.  This  paper  presents  a  new  generative  classifier  called  MassBayes 
that  estimates  multi-dimensional  likelihood  without  making  any  explicit 
assumptions.  It  aggregates  the  multi-dimensional  likelihoods  estimated 
from  random  subsets  of  the  training  data  using  varying  size  random 
feature  subsets.  Our  empirical  evaluations  show  that  MassBayes  yields 
better  classification  accuracy  than  the  existing  generative  classifiers  in 
large  data  sets.  As  it  works  with  fixed-size  subsets  of  training  data,  it  has 
constant  training  time  complexity  and  constant  space  complexity,  and  it 
can  easily  scale  up  to  very  large  data  sets. 

Keywords:  Generative  classifier,  Likelihood  estimation,  MassBayes. 


1  Introduction 

The  learning  task  in  classification  is  to  learn  a  model  from  a  labelled  training  set 
that  maps  each  instance  to  one  of  the  predefined  classes.  The  model  learned  is 
then  used  to  predict  a  class  label  for  each  unseen  test  instance.  Each  instance  x 
is  represented  by  a  d-dimensional  vector  (xi,  X2,  ■  ■  ■ ,  Xd)  and  given  a  class  label 
V  £  {2/1,  J/2,  •  •  • ,  2/c},  where  c  is  the  total  number  of  classes.  The  training  set  D  is 
a  collection  of  labelled  instances  {(xW,  yW)}  (i  =  1, 2,  •  •  • ,  TV). 

The  generative  approach  of  classifier  learning  models  the  joint  distribution 
p(x,  y)  and  predicts  the  most  probable  class  as: 

y  =  argmax  p(x,y)  (1) 

v 

Using  the  product  rule,  the  joint  probability  can  be  factorised  as: 

p(*,y)  =p(y)  x  p(x\y)  (2) 

Generative  classifiers  learn  either  the  joint  distribution  p(x,  y)  or  the  likelihood 
p(x\y).  However,  estimating  p(x,y)  or  p{x\y)  directly  from  data  using  existing 
data  modelling  techniques  is  difficult.  Density  estimators  such  as  Kernel  Density 
Estimation  [1],  fc-Nearest  Neighbour  [1]  and  Density  Estimation  Trees  [2]  are 
impractical  in  large  data  sets  due  to  their  high  time  and  space  complexities.  The 
research  has  thus  focused  on  learning  one-dimensional  likelihood  to  approximate 
p(x,  y)  in  different  ways. 


J.  Pei  et  al.  (Eds.):  PAKDD  2013,  Part  I,  LNAI  7818,  pp.  136-148,  2013. 
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Existing  generative  classifiers  allow  limited  probabilistic  dependencies  among 
attributes  and  assume  some  kind  of  conditional  independence.  Different  gener¬ 
ative  classifiers  make  different  assumptions  and  allow  different  level  of  depen¬ 
dencies.  They  learn  a  network  (or  its  simplification)  of  probabilistic  relationship 
between  the  attributes  and  estimate  the  likelihood  at  each  node  given  its  par¬ 
ents  from  D  (i.e.,  one-dimensional  likelihood  estimation).  The  joint  distribution 
p(x,  y)  is  estimated  as  the  product  of  likelihood  of  each  attribute  given  their 
parents  in  the  network: 

p(x,y)  =  p(x i|7ri)  x  p(x2 |7t2)  x  •  •  •  x  p{xd\TTd)  x  p(y\TTy)  (3) 

where  7Tj  is  parent{xi)  and  n y  is  parent(y). 

Though  these  one-dimensional  likelihood  generative  classifiers  have  been 
shown  to  perform  well  [3, 4, 5, 6, 7],  we  hypothesize  that  a  multi-dimensional  like¬ 
lihood  generative  classifier  will  produce  even  better  results. 

In  this  paper,  we  propose  an  ensemble  approach  to  estimate  multi-dimensional 
likelihood  without  making  any  explicit  assumption  about  attribute  indepen¬ 
dence.  The  idea  is  to  construct  an  ensemble  of  t  multi-dimensional  likelihood 
estimators  using  random  sub-samples  T>,  C  D  ( i  =  1, 2,  •  •  • ,  t).  Each  estimator 
estimates  the  multi-dimensional  likelihood  using  a  random  subset  of  d  attributes 
from  T>i .  The  average  estimation  from  t  estimators  provides  a  good  approxima¬ 
tion  of  p(x\y).  We  call  the  resulting  generative  classifier  MassBayes.  It  has  con¬ 
stant  space  complexity  and  constant  training  time  complexity  because  it  employs 
a  fixed-size  training  subset  to  build  each  of  the  t  estimators. 

The  rest  of  the  paper  is  structured  as  follows.  Section  2  provides  a  brief 
overview  of  well-known  generative  classifiers.  The  proposed  method  is  described 
in  Section  3  followed  by  the  implementation  details  in  Section  4.  The  empirical 
evaluation  results  are  presented  in  Section  5.  Finally,  we  provide  conclusions  and 
directions  for  future  research  in  Section  6. 


2  Existing  Generative  Classifiers 

Naive  Bayes  (NB)  [3]  is  the  simplest  generative  approach  that  estimates  p(x,  y) 
by  assuming  that  the  attributes  are  statistically  independent  given  y: 

d 

p(x,  v)nb  =p{y)  Y[p(xi\y)  (4) 

i= 1 

Despite  the  strong  independence  assumption,  it  has  been  shown  that  NB  pro¬ 
duces  impressive  results  in  many  application  domains  [3,4].  Its  simplicity  and 
clear  probabilistic  semantics  have  motivated  researchers  to  explore  different  ex¬ 
tensions  of  NB  to  improve  its  performance  by  relaxing  the  unrealistic  assumption. 

BayesNet  [5]  learns  a  network  of  probabilistic  relationship  among  the  at¬ 
tributes  including  the  class  attribute  from  the  training  data.  Each  node  in  the 
network  is  independent  of  its  non-descendants  given  the  state  of  its  parents.  At 
each  node,  the  conditional  probabilities  with  respect  to  its  parents  are  learned 
from  D.  The  joint  probability  p(x,  y)  is  estimated  as: 
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d 

P(.X,y)  BayesNet  =  P(y  |%)  Y[p{xi\n,)  (5) 

i—1 

Learning  an  optimal  network  requires  searching  over  a  set  of  every  possible 
network,  which  is  exponential  in  d.  It  is  intractable  in  high-dimensional  problems 
[8].  NB  is  the  simplest  form  of  a  Bayesian  network,  where  each  attribute  is 
dependent  on  y  only. 

In  another  simplification  of  BayesNet,  A?iDE  [7]  relaxes  the  independence 
assumption  by  allowing  dependency  between  y  and  a  fixed  number  of  privileged 
attributes  or  super-parents.  The  other  attributes  are  assumed  to  be  independent 
given  the  n  super-parents  and  y.  AnDE  with  n  =  0,  AODE,  is  NB.  A?rDE  avoids 
the  expensive  searching  in  learning  probabilistic  dependencies  by  constructing  an 
ensemble  of  n-dependence  estimators.  The  joint  probability  p(x,  y)  is  estimated 

p(x,y)AnDE  =  ^2  p(xs,y)  p(xj\xs,y)  (6) 

seSn  je{i,2,-,d}\s 

where  Sn  is  the  collection  of  all  subsets  of  size  n  of  the  set  of  d  attributes 
{1, 2,  •  •  • ,  d};  and  xs  is  a  ?r-dimensional  vector  of  values  of  x  defined  by  s. 

It  has  been  shown  that  AIDE  and  A2DE  produce  better  predictive  accuracy 
than  the  other  state-of-the-art  generative  classifiers  [6,7].  However,  it  only  allows 
dependencies  on  a  fixed  number  of  attributes  and  y.  Because  of  the  high  time 
complexity  of  O  1  and  space  complexity  of  O  ,  where 

v  is  the  average  number  of  values  for  an  attribute  [7],  only  A2DE  or  A3DE  is 
feasible  even  for  a  moderate  number  of  dimensions.  Furthermore,  selecting  an 
appropriate  value  of  n  for  a  particular  data  set  requires  a  search. 

AnDE  and  many  other  implementations  of  BayesNet  require  all  the  attributes 
to  be  discrete.  The  continuous-valued  attributes  must  be  discretised  using  a 
discretisation  method  before  building  a  classifier. 

3  MassBayes:  A  New  Generative  Classifier 

Rather  than  aggregating  an  ensemble  of  n-dependence  single-dimensional  like¬ 
lihood  estimators,  we  propose  to  aggregate  an  ensemble  of  t  multi-dimensional 
likelihood  estimators  where  each  likelihood  is  estimated  using  different  random 
subsets  of  d  attributes  from  data.  The  likelihood  p(x\y)  is  estimated  as: 

p(*\y)  =  7  p(xa\y)  (7) 

seGt 

where  Gt  is  a  collection  of  t  subsets  of  varying  sizes  of  d  attributes;  and  x9  is  a 
| g | -dimensional  vector  of  values  of  x  defined  by  g;  and  1  <  |g|  <  d. 

Eachp(x9|j/)  is  estimated  using  a  random  subset  of  training  instances  T>  C  D, 
where  |2?|  =  ^  <  N.  , 

_  (8) 

1  is  a  binomial  coefficient  of  n  out  of  d. 
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where  1 Vyyi  |  is  the  number  of  instances  having  attribute  values  xff  belonging  to 
class  y  in  V  and  \Vy\  is  the  number  of  instances  belonging  to  class  y  in  V. 

Rather  than  relying  on  a  specific  discretisation  method  in  the  preprocess¬ 
ing  step,  we  propose  to  build  a  model  directly  from  data,  akin  to  an  adaptive 
multi-dimensional  histogram,  to  determine  xg  which  adapts  to  the  local  data 
distribution.  The  feature  space  partitioning  we  employed  (to  be  discussed  in 
Section  4)  produces  large  regions  in  sparse  area  and  small  regions  in  the  dense 
area  of  the  data  distribution. 

Let  T(-)  be  a  function  that  divides  the  feature  space  into  non-overlapping 
regions  and  T(x)  be  the  region  where  x  falls.  In  a  multi-dimensional  space,  each 
instance  in  V  can  be  isolated  by  splitting  only  on  few  dimensions  i.e. ,  only  a 
subset  of  d  attributes  (jC{  1,  2,  •  •  • ,  d})  is  used  to  define  T(x).  Hence,  |T,j/  Xs|  is 
the  number  of  instances  belonging  to  class  y  in  the  region  T(x).  Let  p(T(x.)\y)  be 
the  probability  of  region  T(x)  when  only  class  y  instances  in  V  are  considered. 

p(T(x)\y)  =p(*g\y)  =  (9) 

The  new  generative  classifier,  called  MassBayes ,  estimates  the  joint  distribution 
as:  i  i  4 

p{x,y)  MassBayes  =  p{v)  ~  ^2  P(X9  \v)  =  P(y)  ^^2  P(Ti(X)\y)  (10) 

g(zGt  i—1 
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Fig.  1.  Different  regions  from  different  T;(-)  (i  =  1,  2,  •  •  • ,  5)  that  cover  x 


The  average  probability  of  t  different  regions  T,  (x)  (i  =  1,2,  con¬ 

structed  using  T>t  C  D,  provides  a  good  estimate  for  p(x|y)  as  it  estimates 
the  multi-dimensional  likelihood  by  considering  the  distribution  in  different  lo¬ 
cal  neighbourhood  of  x  in  the  data  space.  An  illustrative  example  is  provided  in 
Figure  1.  Note  that,  the  estimator  employed  in  MassBayes  is  not  a  true  density 
estimator  as  it  does  not  integrate  to  1. 

MassBayes  has  the  following  characteristics  in  comparison  with  A nDE: 

1.  In  each  estimator,  AnDE  estimates  one-dimensional  likelihood  given  a 
fixed  number  of  super-parents  and  y,  whereas  MassBayes  estimates  multi¬ 
dimensional  likelihood  using  varying  number  of  dimensions. 

2.  In  AnDE,  the  ensemble  size  is  fixed  to  (^).  But,  MassBayes  allows  the  flex¬ 
ibility  for  users  to  set  the  ensemble  size. 
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3.  A?rDE  requires  continuous-valued  attributes  to  be  discretised  before  build¬ 
ing  the  model.  The  performance  of  AnDE  is  affected  by  the  discretisation 
technique  used.  In  contrast,  MassBayes  builds  models  directly  from  data. 
It  can  be  viewed  as  a  dynamic  multi-dimensional  discretisation  where  the 
information  loss  is  minimised  by  averaging  over  multiple  models. 

4.  Each  model  in  MassBayes  is  built  with  training  subset  of  size  ijj  <  N  which 
gives  rise  to  the  constant  training  time.  In  contrast,  each  model  in  A?rDE  is 
trained  using  the  entire  training  set. 

5.  A?rDE  is  a  deterministic  algorithm  whereas  MassBayes  is  a  randomised  al¬ 
gorithm. 

6.  Like  AnDE,  MassBayes  is  a  generative  classifier  without  search. 

4  Implementation 

In  order  to  partition  the  feature  space  to  define  the  regions  !}(•),  we  use  the 
implementation  described  by  Ting  and  Wells  (2010)  using  a  binary  tree  called 
h:d-tree  [9].  A  parameter  ft  defines  the  maximum  level  of  sub-division.  The 
maximum  height  of  a  tree  is  ft  x  d. 

Let  the  data  space  that  covers  the  instances  in  V  be  A.  The  data  space 
A  is  adjusted  to  become  S  using  a  random  perturbation  conducted  as  follows. 
For  each  dimension  j,  a  split  point  Vj  is  chosen  randomly  within  the  range 
maXj(A)  —  mirij(A).  Then,  the  new  range  Sj  along  dimension  j  is  defined  as 
[vj  —  r,  Vj  +  r],  where  r  =  max(vj  —  minj(A),maXj(A)  —  Vj).  The  new  range  on 
all  dimensions  defines  the  adjusted  work  space  for  the  tree  building  process. 

A  subset  T>  is  constructed  from  D  by  sampling  if)  instances  without  replace¬ 
ment.  The  sampling  process  is  restarted  with  D  when  all  the  instances  are  used. 
The  random  adjustment  of  the  work  space  and  random  sub-sampling,  as  de¬ 
scribed  earlier,  ensure  that  no  two  trees  are  identical. 

The  dimension  to  split  is  selected  from  a  randomised  set  of  d  dimensions  in 
a  round-robin  manner  at  each  level  of  a  tree.  A  tree  is  constructed  by  splitting 
the  work  space  into  two  equal-volume  half  spaces  at  each  level.  The  process 
is  then  repeated  recursively  on  each  non-empty  half-space.  The  tree  building 
process  stops  when  there  is  only  one  instance  in  a  node  or  the  maximum  height 
is  reached. 

At  the  leaf  node,  the  number  of  instances  in  the  node  belonging  to  each  class 
is  stored.  Figure  2  shows  a  typical  example  of  an  implementation  of  T(-)  as  an 
h:d-tree  for  ft  =  2  and  d  =  2.  The  dotted  lines  enclosed  the  instances  in  T>  and 
the  solid  lines  enclosed  the  adjusted  work  space  which  has  ranges  and  62  on 
x±  and  X2  dimensions.  Rl,  R2,  R3,  Rl  and  R5  represent  different  regions  in  T(-) 
depending  on  the  data  distribution  in  V.  Region  Rl  is  defined  by  splitting  the 
work  space  in  x\  dimension  only,  g  =  {1},  whereas  the  other  four  regions  use 
dimensions  x\  and  X2,  he.,  g  =  {1,2}. 

In  the  original  implementation  by  Ting  and  Wells  (2010)  for  mass  estimation, 
each  tree  is  built  to  the  maximum  height  of  ft  x  d  resulting  in  equal-size  regions 
regardless  of  the  data  distribution  [9] .  In  our  implementation,  in  order  to  adapt 
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Fig.  2.  An  example  of  an  h:d-tree  for  h  =  2  and  d  =  2 


to  the  data  distribution,  the  tree  building  stops  early  once  the  instances  are 
separated.  We  use  the  same  algorithm  as  used  by  Ting  and  Wells  (2010)  to 
generate  h:d-trees  to  represent  Tj(-)  in  [9]  with  the  required  modification. 

The  procedures  to  generate  t  trees  from  a  given  data  set  D  are  provided  in 
Algorithms  1  and  2. 

The  maximum  height  of  each  tree  is  hd1  and  ip  instances  have  to  be  assigned 
to  either  of  the  two  child  nodes  at  each  level  of  a  tree.  Hence,  the  total  training 
time  complexity  to  construct  t  trees  is  Opthdip).  There  are  a  maximum  of  ip 
(as  ip  <  2hd  in  general)  leaf  nodes  in  each  tree.  The  total  space  complexity  is 
0(t(d  +  c)  ip). 

The  time  and  space  complexities  of  two  variants  of  NB  (NB-KDE  that  es¬ 
timates  p{xi\y)  through  kernel  density  estimation  [4];  and  NB-Disc  that  esti¬ 
mates  p(xi\y)  through  discretisation  [10]),  AnDE  and  MassBayes  are  presented  in 
Table  1.  Both  training  time  complexity  and  space  complexity  of  MassBayes  are 


Table  1.  Time  and  space  complexities  of  different  generative  classifiers 


Classifiers 

Training  time 

Testing  time 

Space 

NB-KDE  [4] 

O(Nd) 

0(cm.d) 

O(cm.d) 

NB-Disc  [6] 

O(Nd) 

O(cd) 

O(cdv) 

AnDE  [7] 

0{N(nd+1)) 

0(0) 

o  (<+>n+1) 

MassBayes 

Opthdip) 

0{thd ) 

O  (t(d  +  c)ip) 

N\  total  number  of  training  instances,  m :  average  number  of  training  instances  in  a 
class,  d:  number  of  dimensions,  c:  number  of  classes,  v :  average  number  of  discrete  values 
of  an  attribute,  n:  number  of  super-parents,  t :  number  of  trees,  h:  level  of  divisions, 
and  ip:  sample  size. 
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independent  of  N.  Note  that  the  complexities  for  NB-Disc  and  AnDE  do  not 
include  the  additional  discretisation  needed  in  the  preprocessing. 


Algorithm  1.  BuildTrees(Z),  t.  4>,h) 

Inputs:  D  -  input  data,  t  -  number  of  trees,  ip  -  sub-sampling  size,  h  -  number  of 
times  an  attribute  is  employed  in  a  path. 

Output:  F  -  a  set  of  t  h:d-trees 

1:  H  4—  h  x  d  {Maximum  height  of  a  tree} 

2:  Initialize  F 
3:  for  *  =  1  to  t  do 

4:  T>  4—  sample(D ,  ip)  {strictly  without  replacement} 

5:  (min, max)  4—  Initialise  Workspace  (7?) 

6:  A  4—  {Randomised  list  of  d  attributes.} 

7:  F  4—  F  U  SingleTree(X>,  min,  max,  0,  A) 

8:  end  for 
9:  return  F 


Algorithm  2.  SingleTree(2?,  min,  max,  i,  A) 

Inputs:  T>  -  input  data,  min  &  max  -  arrays  of  minimum  and  maximum  values  for 
each  attribute  that  define  a  work  space,  A  -  a  randomised  list  of  d  attributes,  £  - 
current  height  level. 

Output:  an  h-.d-tree 

1:  Initialize  Node(-) 

2:  while  (l  <  H  and  \T>\  >  1)  do 

3:  q  «—  nextAttribute(A,£)  {Retrieve  an  attribute  from  A  based  on  height  level.} 

4:  midq  <—  ( maxq  +  minq)/2 

5:  T>i  <—  filter(V,  q  <  midp) 

6:  XV  t—  filter(V,  q  >  midq) 

7:  if  (\T>i\  =  0  )  or  (|XV|  =  0)  then  {Reduce  range  for  single-branch  node.} 

8:  if  (1X1,1  >  0  )  then  maxq  <—  midq 

9:  else  minq  t—  midq 

10:  end  if 

11:  £<-£+1 

12:  continue  at  the  start  of  while  loop 

13:  end  if 

14:  {Build  two  nodes:  Left  and  Right  as  a  result  of  a  split  into  two  half-spaces.} 

15:  temp  4—  maxq,  maxq  4—  midq 

16:  Left  <—  SingleTr ee(T>i,min,max,£  +  1  ,A) 

17:  maxq  <—  temp;  minq  4—  midq 

18:  Right  4—  SingleTree(XV,  min,  max,  l  +  1,  A) 

19:  terminate  while  loop 

20:  end  while 

21:  classCount  4—  updateClassC ount(T>) 

22:  return  Node(Left,  Right,  Split Att  4—  q,  SplitValue  4—  midq,  classCount) 
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5  Empirical  Evaluation 

This  section  presents  the  results  of  the  experiments  conducted  to  evaluate  the 
performance  of  MassBayes  against  seven  well  known  contenders:  two  variants  of 
NB  (NB-KDE  and  NB-Disc),  BayesNet,  three  variants  of  AnDE  (AIDE,  A2DE, 
A3DE)  and  decision  tree  J48  (i.e. ,  the  WEKA  [11]  version  of  C4.5  [12]). 

MassBayes  was  implemented  in  Java  using  the  WEKA  platform  [11]  which  also 
has  implementations  of  NB,  BayesNet,  AIDE  and  J48.  For  A2DE  and  A3DE, 
we  used  the  WEKA  implementations  provided  by  the  authors  of  A?rDE. 

All  the  experiments  were  conducted  using  a  10-fold  cross  validation  in  a  Linux 
machine  with  2.27  GHz  processor  and  100  GB  memory.  The  average  accuracy 
(%)  and  the  average  runtime  (seconds)  over  a  10-fold  cross  validation  were  re¬ 
ported.  A  two-standard-error  significance  test  was  conducted  to  check  whether 
the  difference  in  accuracies  of  two  classifiers  was  significant.  A  win  or  loss  was 
counted  if  the  difference  was  significant;  otherwise,  it  was  a  draw. 

Ten  data  sets  with  N  >  10000  were  used.  All  the  attributes  in  the  data  sets  are 
numeric.  The  properties  of  the  data  sets  are  provided  in  Table  2.  The  RingCurve, 
Wave  and  OneBig  data  sets  were  three  synthetic  data  sets  and  the  rest  were 
real-world  data  sets  from  UCI  Machine  Learning  Repository  [13].  RingCurve 
and  Wave  are  subsets  of  the  RingCurve- Wave- TriGaussian  data  set  used  in  [9] 
and  OneBig  is  the  data  set  used  in  [14]. 


Table  2.  Properties  of  the  data  sets  used 


Data  sets 

#N 

#d 

#c 

Data  sets 

#N 

#d 

#c 

CoverType 

581012 

10 

7 

RingCurve 

20000 

2 

2 

MiniBooNE 

129596 

50 

2 

Letters 

20000 

16 

26 

OneBig 

68000 

20 

10 

Magic04 

19020 

10 

2 

Shuttle 

58000 

8 

7 

Mamograph 

11183 

6 

2 

Wave 

20000 

2 

2 

Pendigits 

10992 

16 

10 

For  A?iDE,  BayesNet  and  NB-Disc,  data  sets  were  discretised  by  a  supervised 
discretisation  technique  based  on  minimum  entropy  [15]  as  suggested  by  the 
authors  of  A?rDE  before  building  the  classification  models. 

Two  variants  of  MassBayes  were  used:  MassBayes  with  (ip  =  5000)  and 
MassBayes'  (ip  =  N).  The  other  two  parameters  t,  and  h  were  set  as  default 
to  100  and  10,  respectively. 

For  BayesNet,  the  parameter  ‘maximum  number  of  parents'  was  set  to  100 
to  examine  whether  a  large  number  of  parents  produces  better  results;  and  the 
parameter  ‘initialise  as  Naive  Bayes’  was  set  to  ‘false’  to  initialise  an  empty 
network  structure.  The  default  values  were  used  for  the  rest  of  the  parameters. 
All  the  other  classifiers  were  executed  with  the  default  parameter  settings. 
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Table  3.  Average  classification  accuracies  (%)  over  a  10-fold  cross  validation 


Data 

sets 

Mass 

Bayes' 

Mass 

Bayes 

A3 

DE 

A2 

DE 

A1 

DE 

Bayes 

Net 

NB- 

KDE 

NB- 

Disc 

J48 

CoverType 

94.00 

78.21 

88.16 

80.81 

72.89 

87.79 

66.72 

66.61 

92.39 

MiniBooNE 

92.68 

91.11 

N/A* 

91.48 

89.58 

90.25 

86.07 

86.29 

90.47 

OncBig 

100.00 

100.00 

N/A* 

99.81 

99.69 

99.99 

99.98 

99.97 

99.84 

Shuttle 

99.89 

99.89 

99.94 

99.94 

99.85 

99.93 

92.68 

94.36 

99.97 

Letters 

96.63 

95.63 

95.11 

94.31 

88.81 

86.97 

74.21 

73.94 

87.92 

RingCurve 

100.00 

100.00 

99.99 

99.99 

99.99 

99.99 

99.27 

99.48 

99.91 

Wave 

100.00 

100.00 

78.51 

78.51 

78.51 

78.51 

77.91 

78.51 

99.79 

Magic04 

85.72 

85.53 

85.08 

84.57 

83.00 

83.46 

76.13 

78.27 

85.01 

Mamograph 

98.69 

98.71 

98.51 

98.37 

98.42 

98.54 

97.86 

97.62 

98.57 

Pendigits 

99.45 

99.28 

98.80 

98.82 

97.84 

96.81 

88.64 

87.9 

96.56 

*  Did  not  complete  because  of  integer  overflow  error. 


Table  4.  Win:Loss:Draw  counts  of  MassBayes  over  the  other  contenders  in  terms  of 
classification  accuracy  based  on  the  two-standard-error  significance  test 


A3DE 

A2DE 

AIDE 

BayesNet 

NB-KDE 

NB-Disc 

J48 

MassBayes' 

4:1:3 

7:1:2 

7:0:3 

7:1:2 

10:0:0 

10:0:0 

7:1:2 

MassBayes 

3:2:3 

6:3:1 

7:0:3 

6:2:2 

10:0:0 

10:0:0 

6:2:2 

Table  5.  Average  runtime  (seconds) 

over  a 

10-fold 

cross  validation 

Mass 

Mass 

A3 

A2 

A1 

Bayes 

NB- 

NB- 

Data  sets 

Bayes' 

Bayes 

DE 

DE 

DE 

Net 

KDE 

Disc 

J48 

CoverType 

1075.8 

45.7 

45.6 

13.9 

4.9 

387.9 

96.3 

3.2 

3690.7 

MiniBooNE 

431.1 

33.7 

N/A 

231.3 

5.9 

308.9 

831.6 

2.1 

323.8 

OneBig 

113.9 

10.5 

N/A 

11.6 

3.9 

432.5 

253.0 

0.8 

15.1 

Shuttle 

48.5 

8.0 

1.8 

0.7 

0.5 

6.8 

1.5 

0.4 

4.2 

Letters 

18.9 

5.5 

11.5 

2.6 

0.8 

4.9 

2.5 

0.4 

7.3 

RingCurve 

4.4 

2.3 

0.2 

0.2 

0.2 

0.3 

2.4 

0.2 

0.4 

Wave 

4.9 

2.1 

0.2 

0.2 

0.2 

0.2 

2.5 

0.1 

0.6 

Magic04 

10.9 

3.9 

0.7 

0.5 

0.2 

0.7 

8.8 

0.2 

3.4 

Mamograph 

4.7 

3.1 

0.3 

0.2 

0.2 

0.3 

0.5 

0.2 

0.4 

Pendigits 

7.3 

3.6 

5.7 

0.9 

0.4 

1.8 

1.6 

0.2 

1.2 

Table  3  shows  the  average  classification  accuracies  of  MassBayes'  and  Mass¬ 
Bayes  in  comparison  to  the  other  contenders.  The  results  of  the  two-standard- 
error  significance  test  in  Table  4  show  that  both  MassBayes'  and  MassBayes 
produced  better  classification  accuracy  than  the  other  contenders  in  most  data 
sets. 
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(a)  Runtime 


(b)  Space 


Fig.  3.  Scale-up  test:  MassBayes  versus  existing  generative  classifiers.  The  base  for 
training  size  ratio  is  7000  instances  and  the  bases  for  runtime  ratio  and  memory  ratio 
are  the  training  time  and  memory  required  to  save  a  classification  model  for  7000 
instances.  Axes  are  on  logarithmic  scales  of  base  10. 


MassBayes  produced  slightly  poorer  results  than  A2DE,  A3DE,  BayesNet  and 
J48  in  CoverType.  This  was  because  the  default  sample  size  was  not  enough  to 
yield  a  good  estimate.  The  accuracy  was  increased  up  to  84.62%  with  if)  =  20000 
and  88.66%  with  r/>  =  50000.  More  samples  are  required  to  grow  the  trees  further 
to  model  the  distributions  well  if  the  class  distributions  in  the  feature  space 
are  complex.  Figure  4(a)  shows  the  improvement  in  accuracy  of  MassBayes  in 
CoverType  when  the  sample  size  was  increased. 

Table  5  presents  the  average  runtime.  In  terms  of  runtime,  MassBayes  was 
an  order  of  magnitude  faster  than  A2DE  in  MiniBooNE;  BayesNet  in  Cover- 
Type,  MiniBooNE  and  OneBig;  NB-KDE  in  MiniBooNE  and  OneBig;  and  J48 
in  CoverType  and  MiniBooNE.  It  was  of  the  same  order  of  magnitude  as  A3DE, 
A2DE,  BayesNet,  NB-KDE  and  J48  in  many  cases  and  an  order  of  magnitude 
slower  than  NB-Disc  and  AIDE.  MassBayes'  was  an  order  of  magnitude  slower 
than  the  other  contenders  in  many  data  sets.  However,  it  was  of  the  same  order 
of  magnitude  as  A3DE  in  Letters;  A2DE  in  MiniBooNE;  BayesNet  and  NB-KDE 
in  MiniBooNE  and  OneBig;  and  J48  in  CoverType  and  MiniBooNE. 

Note  that  the  reported  runtime  results  for  AnDE,  BayesNet  and  NB-Disc 
did  not  include  the  discretisation  time  that  must  be  done  as  a  preprocessing 
step,  which  give  the  existing  generative  classifiers  (except  NB-KDE)  an  unfair 
advantage  over  MassBayes.  The  discretisation  time  can  be  substantially  large  in 
large  data  sets.  For  example,  the  discretisation  took  52  seconds  in  the  largest 
data  set,  CoverType.  This  discretisation  time  alone  was  more  than  the  total 
runtime  of  MassBayes.  Thus,  MassBayes  in  effect  runs  faster  than  all  existing 
generative  classifiers  on  equal  footing. 

In  order  to  examine  the  scalability  of  the  classifiers  in  terms  of  training  time 
and  space  requirements  with  the  increase  in  training  size  N,  we  used  the  48- 
dimensional  (42  irrelevant  attributes  with  constant  values)  RingCurve-Wave- 
Tri-Gaussian  data  set  previously  employed  by  Ting  and  Wells  (2010)  in  [9],  The 
training  data  size  was  increased  from  7000  to  70000,  half-a-million,  1  million  and 
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Fig.  4.  Effect  of  parameters  ip  and  t  on  the  classification  accuracy  and  runtime  of 
MassBayes  in  the  CoverType  data  set.  The  base  for  the  runtime  ratio  while  varying 
ip  and  t  is  the  total  runtime  (training  and  testing  over  a  10-fold  cross  validation)  for 
ip  =  500  and  t  =  10,  respectively.  The  horizontal  axis  of  t  and  the  vertical  axis  of 
runtime  ratio  in  (b)  are  on  logarithmic  scales  of  base  10. 

10  million  by  a  factor  of  1,  10,  75,  150  and  1500,  respectively.  Figure  3  shows  the 
increase  in  classification  model  building  time  and  memory  space  required  to  store 
the  classification  model  for  different  generative  classifiers.  Note  that  the  discreti¬ 
sation  time  was  not  included  in  the  presented  results.  The  discretisation  time 
increases  linearly  with  the  increase  in  training  data  size.  This  additional  time  for 
discretisation  will  increase  the  training  time  of  A?rDE,  BayesNet  and  NB-Disc. 
MassBayes  had  constant  training  time  and  constant  space  requirements. 

In  order  to  examine  the  sensitivity  of  the  parameters  ip,  t  and  h  in  classifica¬ 
tion  accuracy  and  runtime  of  MassBayes,  we  conducted  a  set  of  experiments  by 
varying  one  parameter  and  fixing  the  other  two  to  the  default  values.  The  result 
of  the  experiment  varying  ip  and  t  in  the  largest  data  set  (CoverType)  is  shown 
in  Figure  4.  The  increase  in  runtime  was  plotted  as  a  ratio  to  show  the  factor  of 
runtime  increased  when  the  parameters  were  increased. 

In  general,  accuracy  increased  up  to  a  certain  point  and  remained  flat  when 
each  of  the  three  parameters  was  increased.  This  indicates  that  the  parameters 
of  MassBayes  are  not  too  sensitive  in  terms  of  classification  accuracy  if  they  are 
set  to  sufficiently  high  values.  The  runtime  increased  linearly  with  t  and  sub- 
linearly  with  ip.  With  fixed  sample  size  (ip  =  5000),  increase  in  h  after  a  certain 
point  did  not  affect  the  runtime  because  the  tree  building  process  stopped  before 
reaching  the  maximum  level  h  once  the  instances  are  separated. 

6  Conclusions  and  Future  Work 

In  this  paper,  we  presented  a  new  generative  classifier  called  MassBayes  that 
approximates  p(x|y)  by  aggregating  multi-dimensional  likelihoods  estimated  us¬ 
ing  varying  size  subsets  of  features  from  random  subsets  of  training  data.  In 
contrast,  existing  generative  classifiers  make  assumptions  about  attribute  inde¬ 
pendence  and  estimate  single-dimensional  likelihood  only.  Our  empirical  results 
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show  that  MassBayes  produced  better  classification  accuracy  than  the  existing 
generative  classifiers  in  large  data  sets. 

In  terms  of  runtime,  it  scales  better  than  the  existing  generative  classifiers  in 
large  data  sets  as  it  builds  models  in  an  ensemble  using  fixed-size  data  subsets. 
The  constant  training  time  and  space  complexities  make  it  an  ideal  classifier  for 
large  data  sets  and  data  streams. 

Future  work  includes  applying  the  proposed  method  in  data  sets  with  discrete 
and  mixed  attributes  and  investigating  the  effectiveness  of  MassBayes  in  the  data 
stream  context.  In  this  paper,  we  have  rigorously  assessed  MassBayes  with  the 
state-of-the-art  Bayesian  classifiers.  In  the  near  future,  we  will  assess  its  perfor¬ 
mance  against  some  well-known  discriminative  classifiers  and  their  ensembles. 
The  feature  space  partitioning  can  be  implemented  in  various  ways.  It  would 
be  interesting  to  investigate  a  more  intelligent  way  of  feature  space  partitioning 
rather  than  dividing  at  mid-point  of  a  randomly  selected  dimension. 
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Abstract 

Despite  their  wide  spread  use,  nearest  neighbour  density  estimators  have  two  fundamental  limitations:  0(n2) 
time  complexity  and  0(n)  space  complexity.  Both  limitations  constrain  nearest  neighbour  density  estimators 
to  small  data  sets  only.  Recent  progress  using  indexing  schemes  has  improved  to  near  linear  time  complexity 
only. 

We  propose  a  new  approach,  called  LiNearN  for  Linear  time  Nearest  Neighbour  algorithm,  that  yields 
the  first  nearest  neighbour  density  estimator  having  0(n )  time  complexity  and  constant  space  complexity, 
as  far  as  we  know.  This  is  achieved  without  using  any  indexing  scheme  because  LiNearN  uses  a  subsampling 
approach  for  which  the  subsample  values  are  significantly  less  than  the  data  size.  Like  existing  density 
estimators,  our  asymptotic  analysis  reveals  that  the  new  density  estimator  has  a  parameter  to  trade  off 
between  bias  and  variance.  We  show  that  algorithms  based  on  the  new  nearest  neighbour  density  estimator 
can  easily  scaleup  to  data  sets  with  millions  of  instances  in  anomaly  detection  and  clustering  tasks. 

Keywords:  k- nearest  neighbour,  density-based,  anomaly  detection,  clustering 


I.  Introduction  and  Motivation 

Existing  methods  have  utilised  nearest  neighbour  density  estimators  as  the  basis  to  solve  all  facets  of 
pattern  recognition  problems  from  classification  and  regression  to  clustering  and  anomaly  detection  [5,  7, 

II,  13,  15], 

While  existing  nearest  neighbour  density  estimators  have  been  effective,  the  time  complexity  is  still 
basically  0(n2)  because  of  the  need  to  find  the  nearest  neighbour  for  every  instance  in  a  given  data  set. 
This  makes  existing  methods  utilising  nearest  neighbour  density  estimator  impractical  for  problems  with 
large  data  sets.  Recent  research  has  substantially  improved  the  fc-nearest  neighbour  search  by  introducing 
various  indexing  schemes  to  speed  up  the  search  (e.g.,  Cover  Trees  [9],  M- Trees  [12]  and  R*-Tree  [8])  to  near 
linear  time  complexity. 

The  premise  of  the  current  research  is  that  finding  the  nearest  neighbour  for  every  instance  in  the  given 
data  set  is  inevitable  which  leads  to  0(n2)  time  complexity.  Since  the  aim  is  to  do  density  estimation,  we 
reject  this  premise  and  find  a  way  to  reduce  the  number  of  distance  calculations  required  to  achieve  this 
aim. 

We  propose  a  new  approach  to  nearest  neighbour  density  estimator.  Instead  of  focusing  on  speeding 
up  the  nearest  neighbour  search,  the  new  approach  first  generates  many  local  regions  from  subsamples  and 
then  produces  the  final  result  in  an  ensemble  method.  The  speedup  is  achieved  because  the  size  of  the 
subsamples  required  is  significantly  smaller  than  the  given  data  set.  This  not  only  eliminates  the  need  of 
using  an  indexing  scheme  but  enables  the  new  density  estimator  to  run  in  orders  of  magnitude  faster  than 
existing  nearest  neighbour  density  estimators. 
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We  make  three  contributions  in  this  paper: 

1.  Introduce  a  new  nearest  neighbour  density  estimator  that  defines  local  neighbourhoods  using  nearest 
neighbours  in  each  of  the  many  subsamples  by  building  a  region  centered  at  each  instance.  This  differs 
from  the  existing  nearest  neighbour  density  estimators  where  the  local  neighbourhoods  are  defined  based  on 
either  fc  nearest  neighbours  or  a  fixed  radius. 

2.  Provide  an  asymptotic  analysis  and  it  reveals  that  the  new  density  estimator  has  a  parameter  which 
trades  off  between  bias  and  variance,  as  in  existing  density  estimators  such  as  fc-nearest  neighbour  density 
estimators. 

3.  Demonstrate  the  advantages  of  the  new  approach  over  the  existing  nearest  neighbour  density  esti¬ 
mators  in  two  tasks:  anomaly  detection  and  clustering.  The  new  approach  reduces  the  time  complexity 
from  0(n 2)  to  0(n)  and  the  space  complexity  complexity  from  0(n )  to  constant.  We  call  the  new  approach 
LiNearN  for  Linear  time  Nearest  Neighbour  algorithm. 

Since  nearest  neighbour  density  estimators  are  the  core  mechanism  in  many  pattern  recognition  algo¬ 
rithms,  we  will  begin  the  next  section  with  a  description  of  existing  nearest  neighbour  density  estimators. 
Section  3  introduces  the  new  nearest  neighbour  density  estimator  and  provides  the  asymptotic  analysis. 
Section  4  describes  how  nearest  neighbour  density  estimators,  both  existing  and  the  new,  are  applied  to 
anomaly  detection  and  clustering  tasks.  Section  5  reports  the  empirical  evaluation  results.  Discussion  and 
the  conclusions  are  provided  in  the  last  two  sections. 


2.  Existing  Nearest  Neighbour  Density  Estimators 


We  describe  three  existing  nearest  neighbour  density  estimators  below. 

1.  A  fc-nearest  neighbour  (fc-NN)  density  estimator  can  be  expressed  as  follows  [11,  32]. 


fkNN(x) 


\N(x,k)\ 

n  X)  \\x-x,\\p 

x'G  N(x,k) 


where  N(x,  k)  is  the  set  of  fc-nearest  neighbours  to  x;  and  |S|  denotes  the  cardinality  of  set  S ,  and  ||x  —  x’\\p 
denotes  the  distance  measured  by  Lp-norm  between  x  and  x' .  The  search  for  nearest  neighbours  is  conducted 
over  D  of  size  n,  where  D  is  the  given  data  set. 

2.  A  kth  nearest  neighbour  density  estimator  is  defined  as  follows  [25]: 


. fkthNNix )  — 


\N(x,k)\ 

na(d,p)\\x  -  xkM 


where  Xk  is  the  kth  nearest  neighbour  to  x  and  a(d,p)  is  the  volume  of  an  unit  ball  in  (JRd,Lp) 
3.  An  e-neighbourhood  density  estimator  is  defined  as  follows: 


/.(x) - 

ne 

where  Ne(x)  =  {q  £  D  |  ||x  —  g||p  <  e}.  Since  the  denominator  ne  is  the  same  for  all  x,  it  is  usually  omitted 
in  the  implementation  (e.g.,  in  DBSCAN  [15]). 

Each  of  the  above  determines  a  local  neighbourhood  based  on  a  global  parameter,  i.e.,  fc  or  e;  and  the 
density  is  calculated  based  on  one  variable:  distance  of  fc-nearest  neighbours  in  fkNN  or  fk‘>>  NN  since  the 
numerator  is  a  constant  fc;  and  Afc(')  in  fe  since  the  denominator  is  a  constant. 

In  addition,  the  nearest  neighbour  search  is  conducted  over  the  entire  data  set,  D ,  which  is  the  main 
computational  expense  of  the  whole  process;  therefore,  leading  to  a  time  complexity  of  0(n2)  for  n  queries. 
Research  has  focused  on  reducing  this  cost  by  devising  different  indexing  schemes. 
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Table  1:  Key  differences  between  existing  nearest  neighbour  algorithms  and  LiNearN  in  terms  of  methodology 
and  time  complexity. 


Existing  NN 

LiNearN 

Methodology 

Single  model 

Density  for  each  x  £  D  is  derived  from  a 
single  local  region  via  NN  searches  (e.g., 
fkNNi  fkthNN  or  fe). 

Indexingf  is  required  to  speed  up  NN 
search.  Often  rely  on  triangle  inequal¬ 
ity  to  prune  the  search  space 

Multiple  models 

Density  for  each  x  £  D  is  derived 
from  many  local  regions  (LR). 

NN  search  without  indexing. 

1.  NN  search  in  a  subset  of  D 
( t  times)  to  define  LR. 

2.  NN  search  to  make  the  final 
estimation  for  each  x  £  D. 

Time  complexity 

-  LR  building 

Not  Applicable 

1.  ip  {ip  +  ^)t 

-  Index  building 

nil  or  n  log  n  | 

Not  Applicable 

-  n  Queries 

n2  or  n  log?i 

2.  ipnt 

f  An  alternative  to  indexing  is  clustering  based  search  [27]  which  often  needs  higher  time  cost  than  indexing. 
|  Without  indexing,  n  queries  in  existing  nearest  neighbour  algorithms  have  0(n 2)  time  complexity;  with 
indexing  methods  such  as  Cover  Trees  [9]  and  M-Trees  [12],  n  queries  have  O(nlogn)  time  complexity. 


We  suggest  a  new  approach  to  compute  density  based  on  nearest  neighbour  with  the  following  distin¬ 
guishing  features: 

•  Both  the  number  of  instances  in  the  local  neighbourhood  and  its  volume  are  adaptive  to  the  data 
distribution  in  the  local  region;  neither  is  fixed  by  a  global  parameter,  unlike  fkNN{ '),  fkthNN{')  and 
/*(•)• 

•  The  nearest  neighbour  search  is  conducted  over  a  data  subset  which  is  significantly  smaller  than  the 
given  data  set. 

We  describe  the  new  density  estimator  in  the  next  section. 


3.  New  nearest  neighbour  density  estimator 

We  propose  a  new  nearest  neighbour  density  estimator,  called  LiNearN  for  Linear  time  Nearest 
Neighbour  algorithm.  It  estimates  the  density  for  a  point  x  by  averaging  densities  of  multiple  local  re¬ 
gions  covering  x.  Whilst  the  local  regions  could  be  implemented  in  different  ways,  we  focus  on  deriving  the 
local  regions  using  nearest  neighbours.  Because  these  local  regions  can  be  defined  by  using  a  significantly 
smaller  data  set  than  the  given  data  set,  the  computational  expense  for  nearest  neighbour  search  is  reduced 
to  such  an  extent  that  an  indexing  scheme  becomes  unnecessary. 

We  describe  LiNearN  in  the  following  five  subsections.  After  describing  the  key  differences  between 
the  new  and  existing  density  estimators  in  the  first  subsection,  LiNearN  is  formally  defined  in  the  second 
subsection  with  an  illustration  in  the  third  subsection.  The  asymptotic  error  analysis  is  given  in  the  fourth 
subsection  followed  by  its  implementation  in  the  fifth  subsection. 

3.1.  Key  Differences 

The  key  differences  between  LiNearN  and  existing  density  estimators  based  on  nearest  neighbour  are 
shown  in  Table  1.  Since  all  parameters,  except  n,  are  constant  and  both  ip  <C  n  and  ’I'  <C  n  (see  definitions  in 
Section  3.2),  the  time  complexity  of  LiNearN  is  0(ri),  which  is  significantly  smaller  than  0(n 2)  or  O(nlogn). 
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Unlike  e-neighbourhood  density  estimator  which  employs  a  global  e  (where  every  local  region  has  the 
same  size),  LiNearN  adapts  the  size  of  each  local  region  to  the  local  data  distribution  —  sparse  regions 
have  large  local  regions;  whereas,  dense  regions  have  small  local  regions.  While  k- nearest  neighbour  density 
estimator  can  adapt  to  local  data  distributions  in  simple  cases  where  there  are  only  two  densities  by  choosing 
an  appropriate  At,  it  is  not  possible  in  more  complex  cases  where  many  different  densities  exist  in  the  data. 

Instead  of  relying  on  a  single  estimation  using  the  entire  data  set,  LiNearN  computes  the  average  of 
multiple  estimations.  Each  of  the  estimations  can  be  done  with  a  significantly  smaller  data  subset  —  this  is 
where  the  significant  speedup  over  existing  nearest  neighbour  density  estimators  is  achieved. 

The  typical  nearest  neighbour  algorithm  (e.g.,  LOF  [11],  ORCA  [7]  or  DBSCAN  [15])  requires  to  store  all 
instances  of  a  given  data  set  for  the  distance  calculation.  In  contrast,  LiNearN  requires  to  store  tip  instances 
only,  which  is  constant. 

Though  LiNearN  requires  nearest  neighbour  search,  only  a  linear  search  is  required  because  it  involves 
a  small  sample  size  only,  where  ip  <C  n  and  U/  <C  n.  Existing  algorithms  which  employ  nearest  neighbour 
density  estimators  must  rely  on  indexing  to  reduce  the  time  complexity  from  0(n 2)  to  O(nlogn)  [9,  12]. 

3.2.  Definitions 

Let  D  be  a  set  of  i.i.d.  samples  in  a  ( $ld,Lp )  space,  where  the  length  of  x  =  {xi,x2,-“  ,xd)T  €E  is 
measured  by  an  Lp-norm: 


IMIp  =  {\xl\P  +  \X2\P  +  ■  ■  ■  +  \X  d\P)lp- 

where  p  is  in  (0,  oo]  ,  and  H^Hoo  =  max{\xi\,  \x2\,  ■  ■  ■  ,  |ara|}.  Furthermore,  let  V  C  D  be  selected  using  i.i.d. 
sampling  without  replacement. 

Definition  1.  Hc  C  (3 ?d,Lp)  is  a  local  region  having  its  center  at  c  £  V  and  its  radius  rc  =  ||| c  —  ar||p, 
where  x  is  the  nearest  neighbour  of  c  in  V.  □ 


Lp-norm  determines  the  shape  of  the  local  region  Hc.  For  example,  Hc  is  a  hypersphere  if  p  =  2,  and  a 
hypercube  if  p  =  oo.  Note  that  a  region  Hc  does  not  overlap  with  any  other  regions  within  a  given  set  of  V. 

Definition  2.  H  is  the  set  of  \V\  local  regions,  Hc;  constructed  from  V:  H  =  {Hc|c  £  V}.  □ 


Definition  3.  Given  H  and  S3  C  D  selected  using  i.i.d.  sampling  without  replacement,  if  3HC  £  H,  x  £  Hc, 
a  distance-based  density  of  x  is  defined  as 


where  cnn(x) 


p{x\H,T>) 


|5KEflcnn(x))l 

1^1  ^  cnn{x) 


arg min  ||c-  x||p  and  S3(Mcnn(x))  =  {o  £  Ucnn(x)\o  £  S3}. 
cex>  s.t.  xeMceH 


(1) 

□ 


p(x\H,'S)  is  given  by  H  and  S3  if  Hc  containing  x  exist  in  H,  while  H  is  defined  by  V.  |S3|  is  generally 
set  to  be  larger  than  |23|.  Note  that  p(x\H,D)  =  0  if  x  ^  Hc  £  H.  This  distance-based  density  is  defined  as 
a  ratio  of  number  of  instances  and  distance,  following  the  similar  density  definition  called  ‘inverse  distance’ 
for  fc-NN  density  estimator  [32]. 


Definition  4.  Average  distance-based  density  of  a  point  x  is  estimated  from  t  local  regions  as  follows: 

J{x)  =  \YjP(.x\H\W),  (2) 

L  i- 1 

where  Hl  is  generated  from  V‘  C  D.  This  summation  excludes  p(x\Hl,  S3*)  such  that  ^Hc  £  H\x  £  Hc. 
S3(Hcran(3.))  of  every  £  Hl  is  estimated  from  S3*  C  D.  Mi,  \Vl\  =  ip,  and  |S3*|  =  ’F.  □ 
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When  producing  an  estimate  from  multiple  local  regions,  it  is  important  to  ensure  that  only  one  estima¬ 
tion  from  each  Hl  contributes  to  the  final  result.  The  use  of  H Cnn(x)  G  ensures  that  f(x)  at  any  point  x 
uses  only  one  of  the  local  regions  in  Hl. 

We  further  introduce  volume-based  density  which  has  a  consistent  definition  with  population  probability 
density  <j>(x)  of  D. 


Definition  5.  Given  H  and  ScD  selected  using  i.i.d.  sampling  without  replacement,  if3Mc  £  H,  x  £  HC) 
a  volume-based  density  of  x  is  defined  as 


p(x\H,D) 


l®QHcrm(a;))l 

|®|Pcmi(x)| 


(3) 


where  cnn(x)  and  are  defined  in  Definition  3.  |Hcn„(x)|  is  the  volume  of  Hcnn (x)  given  by 

l^-cnn(x) I  =  a(^P)rcnn(j)>  where  a(d,p)  is  the  volume  of  an  unit  ball  in  (SR d,Lp).  □ 


For  example,  a(d,  2)  =  r(d/2+i)  where  F(-)  is  a  gamma  function,  and  a(d,  oo)  =  2d.  p(x\H,'D)  is  related 
to  p(x\H,D)  as  follows: 


p(x\H,J))=  lMcnn{x)lp(x\H,V). 

^cnn(x) 

We  define  average  volume-based  density  by  p(x\H,'D)  as  follows. 


(4) 


Definition  6.  Similar  to  Definition  f,  average  volume-based  density  of  a  point  x  is  estimated  from  t  local 
regions  as  follows: 


C(z)  =  iy>(4ff\2T).  (5) 

r  i=i  □ 

For  the  rest  of  this  paper,  we  will  refer  to  the  shape  for  each  region,  Hc,  as  a  hypercube  by  using  p  =  oo, 
with  the  understanding  that  the  shape  for  each  region  can  be  easily  changed  by  changing  the  ZA-norm. 

3.3.  Illustration 

An  example  of  hypercubes,  as  a  result  of  using  L°°-norm  in  a  2-dimensional  data  set,  generated  from 
V  C  D  is  shown  in  Figures  1(a)  and  1(b).  Figure  1(c)  shows  an  example  of  estimating  density  for  an  instance 
x  using  multiple  hypercubes  from  {Hl\i  =  1, . . .  ,t}  where  t  =  4. 

The  number  of  hypercubes  in  H ,  ip,  determines  the  smoothness  of  the  estimation.  It  has  the  similar  effect 
as  the  k  parameter  in  fc-nearest  neighbour  density  estimator.  Three  example  density  distributions,  as  shown 
in  Figures  2(a)-2(c),  are  generated  by  LiNearN  using  three  different  values  of  ip.  Low  ip  produces  a  smooth 
distribution;  whereas,  high  ip  produces  a  more  spiky  corresponding  distribution.  The  density  distributions 
as  generated  by  fkNN  and  fe  are  shown  in  Figures  2(d)-2(f)  and  Figures  2(g)-2(i),  respectively. 

Notice  that  there  is  an  anomaly  cluster  in  the  bottom  left  corner  of  Figure  1(a)  which  consists  of  five 
instances  densely  packed  into  a  small  region.  The  density  distributions  produced  by  LiNearN  in  Figure  2 
shows  that  the  anomaly  cluster  has  low  density  when  a  small  ip  is  used  and  the  ‘true’  density  only  starts  to 
emerge  when  a  high  ip  is  used.  This  feature  is  especially  useful  for  the  purpose  of  anomaly  detection  and 
has  the  following  impacts: 

•  Both  scattered  anomalies  and  clustered  anomalies  can  be  detected  using  a  small  value  of  ip. 

•  The  time  cost  increases  linearly  with  ip.  A  small  ip  value  means  that  LiNearN  could  detect  anomalies 
with  a  small  time  cost. 
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(a)  Given  data  set,  D.  Modified 
from  [21], 


(b)  Example  hypercubes  for  |D|=4 
and  |S|  =  16. 


(c)  Example  density  estimation 
from  multiple  hypercubes  using  f(x) 
in  Equation  (2). 


Figure  1:  An  example  of  constructing  hypercubes  using  T>  C  D  (see  Algorithm  2  in  Section  3.5)  and  assigning 
sample  mass  in  each  hypercube  to  estimate  density  using  3c  D.  These  two  processes  are  shown  in  (b). 


Figure  2:  Example  density  distributions  for  the  data  set  shown  in  Figure  1(a).  Figures  (a)-(c)  show  the 
distributions  generated  from  LiNearN  having  /(•)  uses  three  different  values  of  ■0  with  U/  =  256  and  t  =  1000. 
The  corresponding  density  distributions  for  fkNN(-)  and  /e ( • )  are  shown  in  (d)-(f)  and  (g)-(i),  respectively. 
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In  contrast,  when  fc-nearest  neighbour  is  used  to  detect  clustered  anomalies  (e.g.,  in  LOF  [11]  and  ORCA 
[7]),  in  order  to  detect  clustered  anomalies,  fc  needs  to  be  set  to  a  value  larger  than  the  number  of  instances 
in  every  anomaly  cluster  [24].  Otherwise,  the  fc-nearest  neighbour  algorithms  would  fail  to  detect  anomaly 
clusters  having  number  of  instances  larger  than  k.  This  has  two  implications.  First,  the  time  cost  increases 
as  k  increases.  Second,  a  more  detrimental  effect  to  the  runtime  of  fc-nearest  neighbour  anomaly  detectors  is 
that  fc  needs  to  be  searched  in  a  large  range  of  values.  This  will  add  a  significant  time  cost  to  their  already 
expensive  0(n2)  time  complexity. 

We  will  demonstrate  these  advantages  of  LiNearN  over  existing  nearest  neighbour  algorithms  in  Sec¬ 
tion  5.1. 

3-4-  Asymptotic  Error  Analysis 

C(x)  can  be  thought  of  as  a  random  variable  because  of  its  dependence  on  D ,  its  random  subsamples 
V1  and  Dl  (i  =  1, . . .  ,t).  Accordingly,  we  analyse  Mean  Squared  Error  (MSE)  of  £(;r)  from  the  population 
probability  density  <p(x).  It  is  defined  as 

MSE(C(x))  =  E[{  C(:r)-#r)}2] 

where  the  expectation  E\-\  is  taken  over  the  distribution  of  £( x ).  This  is  rewritten  by  introducing  the 
expectation  of  ({x),  E[£(a:)],  as  follows  [30]. 

MSE{  Cfc(ar))  =  {E[Ck(x)\-<P(x)}2 

+E[{tk(x)~El(k(x)}}2].  (6) 

The  first  term  on  the  rhs  is  called  ‘square  bias’  and  the  second  term  ‘variance.’ 

Preliminary  to  evaluating  magnitudes  of  these  terms,  we  present  the  moment  expectations  of  the  nearest 
distance  distribution  at  a  point  x  £  under  a  generic  population  distribution  </>(: r)  in  a  (3 ftd,Lp)  space. 
Evans  et  al.  [17]  already  assessed  its  associated  problem.  Although  they  did  not  explicitly  state  the  applica¬ 
bility  of  its  analysis  to  generic  Lp  distance  measures,  the  applicability  was  indicated  in  a  subsequent  paper 
[16].  These  papers  focused  on  the  moment  expectations  marginalized  over  a  compact  convex  space  having 
a  unit  volume.  Here,  we  further  assess  the  moment  expectations  at  a  point  x  £  based  on  their  analyses. 

Theorem  1.  Let  C  be  a  compact  convex  body  in  (3?d,  Lv).  Assume  that  all  instances  in  D  and  hence  V  C  D 
are  i.i.d.  sampled  from  C  according  to  their  population  probability  density  <p(x)  satisfying  three  conditions: 
<p(x)  is  continuous,  positive,  and  has  bounded  partial  derivatives  for  all  x  £  C.  Then  for  all  0  <  e  <  1/d  and 
integer  m  >  0,  the  expectation  of  the  m-th  moment  of  rc  defined  in  Definition  1  is  represented  as  follows  for 
some  0  <  v(d,p,x)  <  1. 


E[r?} 


c(d,m,p )  \  Qt  1  ) 

(if  +  l)m/d  ^  ijjm/d+e  ' 


(7) 


holds  as  ip  oo,  where  ip  =  \ T>\,  and  c(d,m,p)  =  ^ d  p  a  cons^an^  n°t  depending  on  ip, 

and  a(d,p)  is  the  volume  of  a  unit  ball  in  (Jkd,Lp). 


The  proof  is  provided  in  Appendix  A. 


The  compactness  and  convexity  of  C  and  the  three  assumptions  of  4>{x)  on  C  virtually  do  not  limit  the 
applicability  of  this  result,  since  C  can  be  chosen  to  sufficiently  cover  the  areas  containing  all  data  points  in 
D ,  a  continuous  and  smooth  (p{x)  can  be  assumed,  and  small  positive  values  of  <p(x)  can  be  assumed  even 
in  the  areas  where  the  data  points  in  D  are  not  located  in  C . 

Now,  we  obtain  the  following  theorem  evaluating  asymptotic  behaviours  of  the  square  bias  and  the 
variance  in  the  MSE  of  f(x)  in  terms  of  number  of  subsamples.  In  addition  to  the  assumptions  of  Theorem  1, 
we  further  introduce  an  assumption  of  bounded  second  order  partial  derivatives  of  (p{x).  These  assumptions 
do  not  limit  the  applicability  of  the  result  to  practical  problems. 
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Theorem  2.  Under  the  assumptions  of  Theorem  1  and  an  assumption  that  <f(x)  has  bounded  second  order 
partial  derivatives  at  each  point  x  £  C,  magnitudes  of  the  square  bias  and  the  variance  of((x)  asymptotically 
behave  as  follows. 


{E[C,(x)\- (f(x)}2  =  0(ip  2/d),  and  (8) 

E[{ax)  -  E[«x)]f]  (9) 

_  r  cxt-1^-1^-1)  ifd  =  i, 

“  \  for  all  0  <  e  <  1/d  if  d  >  2. 

The  proof  is  provided  in  Appendix  B. 

This  result  indicates  that  the  square  bias  of  £(a;)  diminishes  and  the  variance  increases  as  the  size  of  T> 
increases,  i.e.,  the  size  of  the  local  region  Hc  decreases,  except  for  the  case  of  one-dimensional  data.  This 
property  of  the  average  volume-based  density  estimator  for  the  size  of  T>  is  similar  to  that  of  the  conventional 
fc-nearest  neighbour  estimator  which  also  shows  decrease  of  the  bias  and  increase  of  the  variance  for  the 
reduction  of  the  k  value.  On  the  other  hand,  a  significant  advantage  of  our  approach  is  the  dependency  of  the 
variance  to  the  inverse  of  t  and  the  inverse  of  the  size  of  S.  By  increasing  these  parameters,  we  can  reduce 
the  variance  while  maintaining  the  squared  bias.  The  average  volume-based  density  estimator  has  superior 
performance  to  the  fc-nearest  neighbour  estimator  in  theory  while  maintaining  the  lower  computation  cost. 


3.5.  Implementation 

The  procedural  outline  of  LiNearN  is  given  below: 

®  Select  a  subsample  V  of  size  if  from  D ,  where  if  <C  \D\. 

©  Vc  €  D,  identify  its  nearest  neighbour;  and  construct  a  hypercube  region  centered  at  c  with  ‘radius’ 
rc.  A  total  of  if  non-overlapping  hypercube  regions  is  produced  from  T>.  (Definitions  1  and  2) 

©  Repeat  steps  ®  and  ®  t  times  to  produce  {Hl\i  =  1, . . .  ,t}. 

®  A  subset  D  C  D  is  used  to  estimate  the  number  of  instances  covered  by  each  liypercube  region. 
(Definition  3) 

©  Estimate  the  density  for  each  x  £  by  averaging  the  densities  of  t  hypercube  regions  that  cover  x. 
(Definition  4) 

Steps  ®  to  ©  are  implemented  in  Algorithm  1,  and  the  actual  construction  of  hypercube  regions  is 
implemented  in  Algorithm  2.  Steps  ©  and  ©  are  implemented  in  Algorithms  3  and  4,  respectively. 


Algorithm  1:  LiNearN(D,  t,  if,  Hi) 

input  :  D  -  input  data,  t,  -  number  of  subsamples  U,  if  -  the  size  of  subsample  used  for  constructing 
a  set  of  hypercube  regions,  H‘.  4/  -  the  size  of  subsample  used  to  estimate  density  in  H' 
output:  {Hl\i  =  1, . . .  ,t}  where  each  H'  contains  a  total  of  if  non-overlapping  hypercube  regions 
with  estimated  densities. 


for  i  =  1  to  t  do 

V  4—  sample{D,if )  {strictly  without  replacement}; 
H‘  t—  BuildHyperCubes('D); 

end 

AssignSampleMass({fT|i  =  1, . . . ,  t},  D,  4>) 
return  {Hl\i  =  1, . . . ,  t}  with  estimated  densities; 


Algorithm  2:  BuildHyperCubes(P) 
input  :  T>  -  subsample  used  to  build  hypercube  regions, 
output:  H  -  a  set  of  \D\  non-overlapping  hypercube  regions. 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


^0; 

for  m  =  1  to  \V\  do 
Initialise  H; 

EL  center  <—  xm; 

xnn  <—  argmin  ||HL center  —  o||oo  {find  the  nearest  neighbour}; 

o£T>\{H..  center} 

H .radius  4—  |||EL center  —  xjvatIIoo; 

El.mass  4—  0; 

H  ^  H  U{H}; 

end 

return 


Algorithm  3:  AssignSampleMass(C,  D, '!') 


input  :  C  -  {Hl\i  =  1, . . . ,  t},  D  -  input  data,  =  |S)|  -  the  size  of  subsample  used  for  density 
estimation. 

output:  C  -  {Hl\i  =  1,. . .  ,t}  with  their  mass  values  estimated  from  £)*,  respectively. 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


for  i  =  1  to  t,  do 

£>*  =  {x\, . . . ,  x<s,}  4—  sample(D ,  ih)  {strictly  without  replacement}; 

for  j  =  1  to  d'  do 

H  search(Hl ,Xj)  {return  H  C  Hl  that  covers  Xj,  i.e.,  ||H. center 
if  H  is  not  NULL  then 
H.mass  4—  El  .mass  +  1; 

end 

end 

end 

return  C; 


Xj  |  joo  <  EL  radius}; 


Algorithm  4:  Density(a:,  C) 

input  :  x  -  input  point,  C  -  {Hl\i  =  1, . . . ,  t}. 
output:  p/t  -  average  density  estimated  for  x. 


1  P  0; 

2  for  i  —  1  to  t  do 


3 

4 

5 

6 

7  end 


El  search{Hl ,x)  {return  El  that  covers  x}; 
if  El  is  not  NULL  then 

p  <r-  p  +  (M.m.ass /W.rcidius); 

end 


8  return  p/t ; 
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Table  2:  Principal  steps  in  LOF  and  LiNearN  for  anomaly  detection. 


Steps 

LOF 

LiNearN 

1 

Compute  density  distribution: 

fkNN{x ) 

Compute  density  distribution: 

fix) 

2 

Compute  LOF(x)  using 

fkNN{x') 

^  \N(x,k)\ 

x'GN(x,k)  1  v  ’  n 
fkNN{x) 

[Step  2  is  not  required  for  LiNearN.] 

3 

Rank  all  instances  based  on  their 
LOF  values  in  descending  order 

Rank  all  instances  based  on  their 
f(x)  values  in  ascending  order 

4.  Density  estimators  in  anomaly  detection  and  clustering  tasks 

4-1.  Anomaly  detection 

The  prevalent  anomaly  detection  approaches  (e.g.,  LOF  and  ORCA)  rely  on  distance-induced  density 
based  on  the  following  definition: 

Density-based  Anomalies  are  instances  in  regions  of  low  density  or  low  relative  density  in  the 
local  neighbourhood. 

Variants  of  the  definition  are  used  in  literature.  Note  that  though  they  do  not  use  the  term  ‘density’  in 
their  definitions,  they  are  essentially  estimating  density  based  on  the  distance  calculations  where  density  is 
a  ratio  of  the  number  of  instances  in  a  spherical  region  and  the  radius  of  the  region.  The  first  definition 
below  has  a  fixed  radius;  whereas,  the  second  and  third  definitions  have  a  fixed  number  of  instances  k. 

1.  ( MinPts ,  e)-Distance  Anomalies  are  instances  which  have  fewer  than  MinPts  instances  within  a 
distance  e  [22]. 

2.  kth  NN  Distance  Anomalies  are  the  top-ranked  instances  whose  distance  to  the  kth  nearest  neigh¬ 
bour  is  greatest  [26]. 

3.  Average  fcNN  Distance  Anomalies  are  the  top-ranked  instances  whose  average  distance  to  the  k 
nearest  neighbours  is  greatest  [4]. 

4.  Anomalies  based  on  Local  Outlier  Factor  (LOF)  have  high  LOF  values,  where  LOF  of  a;  is  a 
ratio  of  average  density  in  the  local  neighbourhood  of  x  and  the  density  of  x  [11].  The  density  can  be 
defined  using  fkN n  or  other  density  estimators. 

For  a  given  x,  LOF  computes  the  average  density  of  k  nearest  neighbours  of  x,  where  the  region  occupied 
by  these  neighbours  is  the  local  neighbourhood  of  x.  LOF  is  computed  as  the  ratio  of  the  average  density  and 
the  density  of  x.  LOF  ss  1  indicates  that  x  has  the  similar  density  as  instances  in  its  local  neighbourhood; 
LOF  1  indicates  that  x  has  significantly  lower  density  than  that  from  its  local  neighbourhood,  or  its 
nearest  neighbours  are  far  away  from  x ;  thus  x  is  more  likely  to  be  an  anomaly. 

As  pointed  by  Breunig  et  al.  [11]  that  the  density  computed  by  k- nearest  neighbour  density  estimator, 
while  suitable  for  detecting  global  anomalies,  will  fail  to  detect  local  anomalies.  LOF  is  a  ‘correction’  of  the 
density  calculated  in  order  to  detect  both  global  and  local  anomalies. 

Table  2  shows  the  principal  steps  in  LOF  and  LiNearN  used  to  detect  anomalies.  Not  only  f(x)  executes 
faster  than  fkNN(x),  LiNearN  can  directly  use  the  density  computed  to  rank  instances  to  detect  anomalies, 
without  the  additional  ‘correction’  in  step  2  required  in  LOF.  This  is  because  /(•)  is  using  the  nearest  neigh¬ 
bour  to  define  local  neighbourhood  which  provides  the  most  localised  estimation  in  a  fc-nearest  neighbour 
approach.  In  contrast,  fkNN(-)  often  requires  fc  1  in  order  to  detect  clustered  anomalies  that  exist  in 
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a  data  set.  The  larger  k  is  the  less  localised  is  the  estimation.  See  the  typical  values  of  k  required  in  the 
experiments  reported  in  Table  6  in  Section  5.1. 

Like  LiNearN,  ORCA  requires  two  steps  only  but  the  two  steps  are  coupled  to  prune  the  search  space 
in  order  to  improve  its  time  complexity.  Unlike  LiNearN,  the  density  computed  by  ORCA  does  not  allow  it 
to  detect  local  anomalies. 

4  ■  2.  Clustering 

In  clustering,  DBSCAN  [15]  employs  the  e-neighbourhood  density  estimator  (fe)  defined  in  Section  2. 
DBSCAN  requires  nearest  neighbour  searches  of  the  entire  data  set  to  find  out  how  many  instances  are 
within  an  e  distance  from  an  instance  of  interest.  A  data  point,  x,  is  identified  as  a  core  point  when  the 
number  of  points  is  equal  to  or  greater  than  an  user  defined  parameter,  MinPts ,  within  e  distance  of  x. 
After  all  core  points  have  been  identified,  the  clustering  process  starts  with  the  first  core  point,  marked 
as  belonging  to  the  first  cluster.  The  next  core  point  is  selected  from  unmarked  core  points  within  the 
boundary  of  the  cluster.  As  more  core  points  are  marked,  the  boundary  of  the  cluster  continues  to  expand. 
This  expansion  continues  until  there  are  no  further  unmarked  core  points  found  within  the  cluster  boundary. 
This  entire  process  is  repeated  for  the  next  cluster  until  there  are  no  more  unmarked  core  points.  Each 
border  point,  which  is  within  the  e-neighbourhood  of  a  core  point  but  is  not  a  core  point,  is  then  connected 
to  the  nearest  cluster. 

Conceptually,  the  new  clustering  algorithm  is  the  same  as  DBSCAN  with  the  following  three  differences. 
Firstly,  we  use  the  new  density  estimator  f  to  define  the  density.  Secondly,  there  are  no  border  points. 
Finally,  the  clustering  is  done  on  the  core  regions  not  on  the  individual  core  points.  We  called  the  algorithm 
LiNearN-Cluster.  Table  3  gives  a  comparison  between  the  point-based  definitions  of  DBSCAN  and  the 
region-based  definitions  of  LiNearN-Cluster. 

LiNearN-Cluster  employs  Algorithms  1  to  3  specified  in  Section  3.5  to  estimate  the  density  of  each 
hypercube  region.  An  additional  procedure  is  required  to  link  all  connecting  hypercubes,  having  intersections 
of  non-empty  sets,  to  form  a  cluster.  The  procedural  outline  of  this  linking  process  is  given  below: 

®  Identify  all  core  regions,  H,  which  has  density  >  MinDensity.  (Algorithm  5) 

®  Remove  all  (noise)  points  in  D  that  do  not  fall  into  any  of  the  core  regions  (Algorithm  6) 

©  Assign  an  id  to  each  core  region  which  has  a  point  x  €  D  (Algorithm  7) 

(a)  Find  one  previously  assigned  core  region  (line  5-13) 

(b)  If  none  found  then  increment  the  current  id  (line  14-17), 

and  assign  the  current  id  to  the  unassigned  core  region  (line  31) 

(c)  If  the  core  region  is  previously  assigned  then  produce  a  link  between  the  current  id  and  previously 
assigned  id  (line  23-27) 

©  Merge  core  regions  which  have  connecting  link  ids  to  form  a  cluster  (Algorithm  7  :  line  36) 
Algorithms  5  to  7  are  shown  in  Appendix  C. 

Figure  3  illustrates  an  example  linking  process  as  outlined  in  the  above  steps  for  instances  aq,  aq,  aq  and 
aq.  Figure  3(a)  shows  a  set  of  five  core  regions.  Three  core  regions  (TAT^Ls)  covering  aq  are  identified 
(Figure  3(b))  and  are  assigned  the  current  id  of  1  (Figure  3(c)).  Figure  3(d)  shows  that  aq  is  covered  by  a 
single  core  region,  T4,  and  it  is  assigned  the  next  id  of  2.  Core  regions  Ti  and  T4  cover  aq  (Figure  3(e)). 
Since  X4  has  been  previously  assigned  an  id  of  2,  T\  will  be  assigned  the  same  id.  The  final  point  aq  falls 
into  core  regions  7)  and  T5  (Figure  3(f)).  Now  that  T)  and  T5  have  previously  assigned  ids,  they  are  linked 
as  a  pair. 

After  core  regions,  covering  all  instances  in  D1  have  been  assigned  id’s,  all  linking  pairs  will  be  merged 
into  a  single  cluster  with  the  same  id  as  the  final  step  of  the  clustering  process. 
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Table  3:  Point-based  definitions  for  DBSCAN  versus  region-based  definitions  for  LiNearN-Cluster.  The 
definitions  for  DBSCAN  are  extracted  from  Ester  et  al.  [15]  for  comparison. 


DBSCAN  LiNearN-Cluster 

(Point-based)  (Region-based) 


Definition  PI:  (e- neighbourhood  of  a  point)  The 
e-neighbourhood  of  a  point  p,  denoted  by  Ne(p), 
is  defined  by  Ne(p)  =  {g€  D\dist(p,  q)  <  e}. 
Definition  P2:  (directly  density-reachable)  A  point 
p  is  directly  density-reachable  from  a  point  q 
wrt  e,  MinPts  if 

1.  p  €  Nfiq)  and 

2.  |Afe(g)|  >  MinPts  (core  point  condition). 


Definition  SI:  T(x)  is  a  core  region  of  point  x 
wrt  MinDensity  if  p(x\H,T>)  >  MinDensity. 

Definition  S2:  Tr(-)  is  density-connected  to 
Ts(-)  wrt  MinDensity  if  there  is  a  chain  of  re¬ 
gions  Ti(-), . . .  ,Tg(-)  where  r  =  1  and  s  =  g  such 
that  Tj(-)  n  Tl+1(-)  0  and  Tfi-)  is  a  core  region 

for  all  i  wrt  MinDensity. 


Definition  P3:  (density-reachable)  A  point  p  is 
density-reachable  from  a  point  q  wrt  e  and 
MinPts  if  there  is  a  chain  of  points  P\,---,  pn, 
Pi  =  q,  pn  =  p  such  that  pi+i  is  directly  density- 
reachable  from  pi . 

Definition  Pfi'  (density-connected)  A  point  p 
is  density-connected  to  a  point  q  wrt  e  and 
MinPts  if  there  is  a  point  o  such  that  both  p  and 
q  are  density-reachable  from  o  wrt  e  and  MinPts. 

Definition  P5:  (cluster)  Let  D  be  a  database  of 
points.  A  cluster  C  wrt  e  and  MinPts  is  a  non¬ 
empty  subset  of  D  satisfying  the  following  condi¬ 
tions: 

1.  Vp,  q  :  if  p  £  C  and  q  is  density-reachable 
fromp  wrt  e  and  MinPts,  then  q  £  C.  (Max¬ 
imally)  . 

2.  Vp,  q  £  C  :  p  is  density-connected  to  q  wrt  e 
and  MinPts  (Connectively). 

Definition  P6:  (noise)  Let  C\,  ■  ■  ■  ,Ck  be  the  clus¬ 
ters  of  the  database  D  wrt  parameters  Cj  and 
MinPtSi,  i  =  1,  •  •  •  ,  k.  Then  we  define  the  noise 
as  the  set  of  points  in  the  database  D  not  belonging 
to  any  cluster  O,  i.e.  noise  =  {p  G  D\\/i  :  p  ^  C*}. 


Definition  S3:  An  arbitrary-shape  cluster  C 

wrt  MinDensity  is  a  non-empty  subset  of  a 
database  D  satisfying  the  following  condition: 
Mr,  s;  Tr{-),  Ts(-)  C  C:  Tr(-)  is  density-connected 
to  Ts(-)  wrt  MinDensity. 

Definition  S4-'  Let  C\, . . . ,  Cu  be  the  clusters  of 
D  wrt  MinDensity.  Noise  is  the  set  of  points 
in  D  not  belonging  to  any  cluster  Cn,  i.e.,  noise 

-ixtm-.Uc,). 
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(c)  The  identified  core  regions  ( T2 ,  T3  and  T5),  that  (d)  Identify  core  region  (T4)  covering  X2  and  assign 
cover  *1,  are  assigned  a  new  id.  a  new  id. 


(e)  Identify  core  region  (Ti)  covering  X3  and  assign  (f)  Identify  core  regions  (Ti  and  T5)  covering  £4  and 
an  existing  id.  create  a  link  between  the  two  core  regions  which  al¬ 

ready  have  assigned  id’s. 

Figure  3:  An  example  of  the  LiNearN-Cluster  linking  process. 
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Table  4:  A  comparison  of  time  and  space  complexities. 


Time  complexity  Space  complexity 

LiNearN 

0{ntifd )* 

0(tifd) 

DOLPHIN 

0{n2d) 

0(nd ) 

O(fnd) f 

o(^)f 

ORCA 

0(n  log  n  ■  d) 

0(nd ) 

LOF 

0(n2d) 

0(nd) 

DBSCAN 

0(n2d ) 

0{nd) 

LiNearN-Cluster 

0(nti/jd) 

Oftipd) 

i  0(t(ip  +  '^)'ipd)  is  the  time  complexity  required  in  ‘training’,  i.e.,  defining  the  local  regions  using  T>  and  estimating  density  of 
the  local  regions  using  D,  as  described  in  steps  (D  to  ®  in  Section  3.5.  0(nt^d)  is  the  time  complexity  in  ‘testing’,  i.e., 
estimating  every  x  E  D,  as  described  in  step  (D.  Since  is  constant  and  significantly  less  than  n,  it  is  omitted  in  the 

Big-O  notation. 

f  Under  a  special  condition:  p  is  the  probability  of  randomly  picking  a  point  from  the  data  set  which  is  a  neighbour  of  the 
point  under  consideration  using  a  search  index;  k  is  the  number  of  nearest  neighbours;  d  is  the  number  of  dimensions. 


4-3.  Time  and  Space  Complexities  for  nearest  neighbour-based  anomaly  detection  and  clustering  algorithms 

This  section  compares  the  time  and  space  complexities  of  three  state-of-the-art  anomaly  detection  algo¬ 
rithms  [3,  11]  and  the  state-of-the-art  clustering  algorithm  [15]  with  LiNearN.  All  of  these  algorithms  uses 
density  as  the  basis  of  their  operations.  Table  4  lists  the  complexities  for  the  four  algorithms,  LiNearN  and 
LiNearN-Cluster. 

In  nearest  neighbour  algorithms,  the  most  computationally  expensive  part  of  the  process  is  to  do  the 
nearest  neighbour  search  which  has  0(n 2)  time  complexity.  These  algorithms  have  already  used  or  could 
use  some  indexing  scheme  to  their  time  complexities. 

LiNearN  has  a  significant  advantage  over  the  three  fc-nearest-neiglibour-based  anomaly  detection  algo¬ 
rithms  in  terms  of  both  time  and  space  complexities.  This  is  mainly  due  to  the  fact  that  LiNearN  only 
needs  a  small  subsample  to  identify  local  neighbourhoods,  where  if  -C  n;  if  (also  t)  can  be  fixed  in  practice, 
regardless  of  the  size  of  the  given  training  set. 

LiNearN-Cluster  requires  steps  for  the  clustering  process  in  addition  to  those  steps  in  LiNearN.  Even 
with  these  additional  steps,  the  time  and  space  complexity  are  still  the  same  as  the  basic  LiNearN. 

Another  distinguishing  feature  is  that  the  space  complexities  of  both  LiNearN  and  LiNearN-Cluster 
are  constant,  independent  of  the  data  size.  These  properties  make  LiNearN  and  LiNearN-Cluster  an  ideal 
candidate  to  apply  to  domains  with  huge  data  size  or  infinite  data  such  as  data  streams. 

5.  Empirical  Evaluations 

All  evaluations  were  conducted  in  the  unsupervised  learning  settings  in  both  anomaly  detection  and 
clustering  tasks. 

The  experiments  were  conducted  as  single  thread  jobs  processed  at  2.27  GHz  in  a  Linux  cluster  with  40 
GB  memory  unless  a  different  memory  requirement  is  specified.  Both  LiNearN  and  LiNearN-Cluster  algo¬ 
rithms  were  written  in  JAVA  in  the  WEKA  platform  [20],  so  as  DBSCAN.  LOF  and  DBSCAN  were  written 
in  Java  in  the  ELKI  platform  version  0.5  [1].  ORCA  was  written  in  C++  (www.stephenbay.net/orca/). 

In  anomaly  detection,  we  compare  LOF  and  ORCA  with  LiNearN  using  the  same  eleven  data  sets  as 
used  by  Liu  et  al.  [24].  Ring-Curve- Wave- TriGaussian,  OneBig  and  Pendigits  data  sets  used  by  Ting  et  al. 
[33]  are  employed  to  compare  LiNearN-Cluster  with  DBSCAN.  The  additional  Animals,  Segment,  WDBC, 
Iris  and  Yeast  data  sets  are  from  UCI  Machine  Learning  Repository  [18].  The  characteristics  of  the  data 
sets  are  shown  in  Table  5.  They  were  selected  to  have  different  data  characteristics  of  data  size,  number  of 
dimensions  and  clusters. 
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Table  5:  Data  sets.  The  first  eleven  data  sets  are  for  anomaly  detection;  the  last  eight  data  sets  are  for 
clustering. 


Data 

Size  n 

d 

anomaly  class 

Http 

567,497 

3 

attack  (0.4%) 

ForestCover  (FC) 

286,048 

10 

class  4  (0.9%)  vs.  class  2 

Mulcross 

262,144 

4 

2  clusters  (1%) 

Smtp 

95,156 

3 

attack  (0.03%) 

Shuttle 

49,097 

9 

classes  2,  3,  5,  6,  7  (7%) 

Mammography 

11,183 

6 

class  1  (2%) 

Satellite 

6,435 

36 

3  smallest  classes  (32%) 

Pima 

768 

8 

pos  (35%) 

Breastw 

683 

9 

malignant  (35%) 

Arrhythmia 

452 

274 

classes  03,  04,  05,  07,  08,  09,  14,  15  (15%) 

Ionosphere 

351 

32 

bad  (36%) 

Animals 

200,000 

72 

4  clusters 

OneBig 

68,000 

20 

9  clusters  and  10,000  noise  instances 

Pendigits 

10,992 

16 

10  clusters 

Segment 

2,310 

19 

19  clusters 

Yeast 

1,484 

8 

10  clusters 

WDBC 

569 

30 

2  clusters 

Iris 

150 

4 

3  clusters 

Ring-Curve- Wave- TriGaussian 

7,000  to  10  million 

48 

7  clusters  in  6  relevant  attributes;  42  irrel¬ 
evant  attributes 

In  anomaly  detection  tasks,  we  conduct  a  parameter  search  for  all  three  algorithms  and  report  the  best 
result.  For  LOF  and  ORCA,  the  best  k  is  searched  between  5  and  4000  (or  up  to  the  data  size  for  small 
data  sets),  with  steps  from  5,  10,  20,  40,  60,  80,  150,  250,  300,  500,  1000,  2000,  3000  to  4000.  For  LiNearN, 
the  best  ip  is  searched  from  2,  4,  8,  16,  32,  64  to  128;  and  we  also  report  the  result  of  the  default  setting 
(i.e. ,  ip  =  2).  The  other  default  settings  are  fixed:  t  =  1000  and  4/  =  256  for  LiNearN;  and  p  =  2  for  both 
LOF  and  ORCA.  The  remaining  parameter  settings  for  ORCA  are  set  to  their  default  values  except  for  the 
number  of  returned  anomaly  points  which  is  set  to  twice  the  number  of  anomaly  points  in  the  selected  data 
set  but  capped  at  50%  of  the  total  data  size.  The  parameter  setting  not  only  provides  information  which  is 
usually  not  available  in  practice  but  also  reduces  the  number  of  distance  calculations  for  the  top  anomaly 
points  only.  This  setting  gives  ORCA  an  unfair  advantage  to  LiNearN  which  computes  the  density  for  every 
point  in  the  data  set. 

We  report  the  result  in  anomaly  detection  task  in  terms  of  CPU  runtime  and  AUC  (Area  Under  ROC 
Curve)  based  on  the  ranked  result.  The  clustering  result  was  reported  in  terms  of  CPU  runtime,  number  of 
clusters  identified,  number  of  unassigned  instances,  and  F-measure  which  was  calculated  based  on  assigned 
instances  only.  F-measure  =  1  when  all  assigned  instances  are  in  the  correct  clusters,  i.e.,  perfect  clustering; 
and  F-measure  =  0  if  all  instances  are  assigned  to  wrong  clusters. 

We  report  the  results  in  anomaly  detection  in  Section  5.1,  clustering  in  Section  5.2  and  a  summary  is 
provided  in  Section  5.3. 

5.1.  Anomaly  Detection 

The  ELKI  platform  has  an  option  to  use  an  indexing  scheme  when  using  LOF  to  speed  up  the  nearest 
neighbour  search  process.  We  used  the  R*-Tree  indexing  [8]  for  all  of  the  data  sets  except  http.  LOF  was 
unable  to  process  http  within  the  physical  memory  of  40GB;  therefore,  we  reported  the  results  without  using 
indexing  for  this  data  set.  We  used  the  default  settings  for  indexing  except  in  the  arrhythmia  data  set.  The 
default  page  size  was  changed  from  4KB  to  32KB  because  of  the  high  dimensionality  in  the  arrhythmia  data 
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Table  6:  AUC  results  for  LiNearN,  LOF  and  ORCA. 


AUC 

Best  Parameter 

LiNearN 

LOF 

ORCA 

LiNearN 

LOF 

ORCA 

II 

-5>- 

best  ip 

iP 

k 

k 

http 

1.00 

1.00 

1.00 

1.00 

2 

500 

3000 

FC 

0.81 

0.95 

0.94 

0.88 

8 

1000 

3000 

mulcross 

1.00 

1.00 

1.00 

1.00 

2 

2000 

3000 

smtp 

0.78 

0.95 

0.95 

0.74 

32 

1000 

40 

shuttle 

0.99 

0.99 

0.98 

0.99 

2 

4000 

4000 

mammography 

0.82 

0.87 

0.68 

0.67 

16 

4000 

80 

satellite 

0.69 

0.72 

0.79 

0.78 

4 

1000 

500 

pima 

0.72 

0.74 

0.72 

0.73 

4 

250 

150 

breastw 

0.98 

0.98 

0.96 

1.00 

2 

300 

300 

arrhythmia 

0.67 

0.71 

0.80 

0.80 

128 

80 

150 

ionosphere 

0.94 

0.96 

0.90 

0.93 

4 

80 

5 

Table  7:  Runtime  results  (in  seconds)  for  LiNearN,  LOF  and  ORCA.  Separate  training  and  testing  results 
are  shown  for  LiNearN. 


LiNearN 


Training  Testing 


LOF 


ORCA 


http 
FC 

mulcross 
smtp 
shuttle 

mammography 
satellite 
pima 
breastw 
arrhythmia 
ionosphere  0.24 


ip  =  2  best  ip 

71.4 

444.5 
33.2 

556.5 

7.8 

8.9 
3.4 
0.21 

0.24  0.10  0.15 

13.03  1.27  28.60 

0.23  0.11  0.13 


19,965 

78,931 

2,918 

94, 336 

2,169 

56,372 

373 

125 

656 

16,137 

127 

4.0 

23.60 

56.1 

0.44 

0.46 

0.44 

1.02 

1.18 

0.45 

0.26 

0.03 

ip  =  2  best  ip 

0.27  0.27  71.4 

0.25  0.29  57.6 

0.29  0.29  33.2 

0.28  0.67  12.6 

0.34  0.34  7.8 

0.29  0.34  1.1 

0.30  0.27  1.3 

0.20  0.21  0.21 

0.23 
0.96 


set.  It  should  be  pointed  out  that  LOF  within  the  ELKI  platform  does  a  pre-computation  of  every  pair-wise 
distance  before  commencing  the  algorithm  as  a  speedup  technique. 

Table  6  shows  the  AUC  results.  It  is  interesting  to  note  that  LiNearN  with  the  default  setting  ip  =  2  has 
competitive  AUC  results  in  seven  out  of  the  eleven  data  sets.  With  a  parameter  search,  all  three  algorithms, 
LiNearN,  LOF  and  ORCA,  produce  similar  AUC  results.  However,  both  LOF  and  ORCA  require  a  much 
wider  range  of  parameter  search  than  LiNearN  in  order  to  achieve  good  AUC  results.  In  contrast,  LiNearN 
achieves  good  AUC  performance  using  small  ip  values.  Note  that  the  smaller  ip  is  the  faster  LiNearN  runs; 
this  applies  to  LOF  and  ORCA  too  for  the  k  parameter.  Both  the  range  of  search  and  the  actual  value 
required  put  LiNearN  in  a  more  favourable  position  than  LOF  and  ORCA. 

In  addition,  LiNearN  also  runs  significantly  faster  than  LOF  and  ORCA,  in  the  large  data  sets.  In  the 
largest  data  set,  http,  LiNearN  is  faster  than  LOF  and  ORCA  by  a  factor  of  279  and  1101,  respectively. 
The  actual  runtime  results  are  shown  in  Table  7. 
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5.1.1.  Scaleup  tests 

We  conducted  two  scaleup  tests.  We  scaled  up  the  data  size  in  the  first  and  the  number  of  dimensions 
in  the  second. 

The  results  of  the  first  scaleup  test  using  Mulcross  are  shown  in  Figure  4.  The  k  parameter  for  LOF  is 
set  to  2,000  and  the  if)  parameter  for  LiNearN  is  set  to  2.  The  left  graph  of  Figure  4  shows  the  result  when 
scaling  up  by  data  size.  When  the  data  size  was  increased  by  a  factor  of  2,  4,  10  and  40,  LiNearN  increased 
its  runtime  by  a  factor  of  2.2,  4.5,  14  and  61,  respectively.  In  contrast,  the  factors  of  runtime  increases  by 
LOF,  without  indexing,  are  4.2,  17,  105,  1700  (the  last  one  is  a  projected  value);  and  the  corresponding 
factors  for  LOF  with  indexing  are  1.9,  5,  15  and  88.  With  ten  million  instances  (at  the  last  point  of  each 
line),  LiNearN  completed  in  just  under  31  minutes;  whereas,  LOF  with  indexing  took  1.7  hours.  Without 
indexing,  LOF  is  projected  to  take  100  days!  These  results  are  consistent  with  the  time  complexities  stated 
in  Table  4. 

The  right  graph  of  Figure  4  shows  the  result  of  the  same  experiment  but  the  number  of  dimensions 
was  increased  from  4  to  5.  Note  that  the  k  value  for  LOF  had  to  be  reduced  from  2000  to  10  in  order  to 
run  within  200GB  of  physical  memory;  whereas,  the  memory  usage  is  under  40G  for  all  points  in  the  left 
graph.  The  reason  for  the  high  memory  usage  is  two  fold:  ELKI  does  the  entire  indexing  in  memory  and 
before  running  the  LOF  algorithm,  it  pre-computes  all  the  distance  calculations  for  each  pair-wise  distance 
for  a  given  k  value.  The  contrast  in  the  left  and  right  graphs  in  Figure  4  show  that  LOF  has  significantly 
increased  its  runtime  and  runtime  ratio  by  increasing  the  number  of  dimension  from  4  to  5  only. 


Mulcross:  4  Dimensions  (fc=2,000) 


Figure  4:  Scaleup  test  for  LiNearN  and  LOF  as  data 
250,000  as  the  base,  these  increases  have  the  ratios  of 
a  projected  value  in  the  left  graph;  and  the  final  point 
available  because  it  took  longer  than  100  days. 


Mulcross:  5  Dimensions  (fc=10) 


Data  Size  Ratio 

size  increases  from  250,000  to  ten  million.  By  using 
1  to  40.  The  final  point  for  LOF  without  indexing  is 
for  LOF  without  indexing,  in  the  right  graph,  is  not 


Figure  5  shows  the  runtime  result  when  scaling  up  by  increasing  the  number  of  dimensions  from  5  by 
factors  of  2,  4,  10  and  20.  For  LiNearN,  the  increases  in  the  runtime  are  1.3,  1.3,  1.6  and  1.9,  respectively; 
LOF  without  indexing,  the  increases  are  1.6,  2.0,  2.5  and  4.0;  however,  with  indexing,  LOF’s  increases  are 
5.1,  15  and  25  for  the  first  three  points.  The  last  point  could  not  produced  because  the  process  ran  out 
of  memory.  For  the  second  last  point  which  has  50  dimensions,  LiNearN  completed  within  134  seconds 
and  LOF  without  indexing  completed  in  over  6  hours;  however,  LOF  with  indexing  took  nearly  18.3  hours 
to  complete.  This  highlights  that  such  an  indexing  scheme  has  significant  overheads  for  problems  with  a 
moderate  number  of  dimensions;  and  LOF  with  indexing  will  run  slower  than  LOF  without  it;  in  addition 
to  high  memory  demands. 

This  experiment  shows  that  while  an  indexing  scheme  such  as  R*-Tree  can  speed  up  the  nearest  neighbour 
search  significantly,  it  has  high  memory  requirements.  Also,  this  speedup  only  occurs  in  problems  with  few 
dimensions.  With  tens  of  dimensions1,  it  is  better  not  to  use  indexing  in  the  Mulcross  data  set. 


1These  are  not  high  dimensional  problems  with  tens  of  thousands  of  dimensions.  LiNearN  and  LOF  are  not  designed  to 
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Figure  5:  Scaleup  test  for  LiNearN  and  LOF  as  the  number  of  dimensions  increases  from  5  to  100.  By  using 
5  as  the  base,  these  increases  have  ratios  of  1  to  20.  The  final  point  for  LOF  when  indexing  is  not  available 
because  the  memory  is  not  sufficient  to  run  the  experiment. 


5.1.2.  Sensitivity  test 

The  aim  of  this  subsection  is  to  examine  how  sensitive  the  parameters  are  to  the  performance  of  LiNearN 
in  terms  of  AUC  and  runtime.  Because  the  results  are  similar,  only  the  results  of  the  two  largest  data  sets 
are  shown.  Figure  6  shows  AUC  and  runtime  as  was  increased  from  10  to  1000.  It  is  interesting  to 
note  that  only  a  small  d/  is  required  to  achieve  good  AUC  result,  and  a  large  value  gives  minor  or  no 
improvement  in  AUC. 
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Figure  6:  AUC  and  runtime  as  a  result  of  varying  T  of  LiNearN.  The  default  settings  are  used  for  the  other 
parameters:  ip  =  2  and  t  =  1000. 


Figure  7  shows  the  AUC  and  runtime  results  as  t  was  increased  from  10  to  1000.  It  also  shows  that  only 
a  small  t  of  100  is  required  to  achieve  good  AUC  result. 

Note  that  the  runtime  increase  between  t  =  100  and  t  =  1000  was  due  to  the  memory  swap  between 
cache  and  main  memory  or  disk,  not  because  of  algorithmic  procedure  as  it  is  linear  to  t. 

Figure  8  shows  the  results  as  ip  was  increased  from  2  to  128  by  multiplying  with  2  at  each  step  increase. 
Among  the  three  parameters,  ip  has  the  highest  influence  on  the  AUC  result  because  it  controls  the  smooth¬ 
ness  of  the  density  distribution  estimated,  as  shown  in  Figure  2.  It  is  interesting  to  note  that  small  ip  values 
achieve  good  AUC  results,  and  this  allows  LiNearN  to  detect  anomalies  fast. 


deal  with  high  dimensional  problems. 
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Figure  7:  AUC  and  runtime  as  a  result  of  varying  t,  of  LiNearN.  The  default  settings  are  used  for  other 
parameters:  tp  =  2  and  T  =  256. 


Figure  8:  AUC  and  runtime  as  a  result  of  varying  ip  of  LiNearN.  The  default  settings  are  used  for  other 
parameters:  t  =  1000  and  T  =  256. 

5.1.3.  Identifying  Local  Anomalies 

With  LiNearN,  the  probability  of  selecting  an  anomaly  into  a  subsample  is  significantly  smaller  than 
normal  points.  Thus,  only  a  small  number  of  t  subsamples  will  include  anomalies. 

Let  e  be  the  probability  of  selecting  an  anomaly  into  a  subsample;  and  t  is  the  number  of  subsamples 
or  models  generated  from  subsamples.  Thus,  there  are  (1  —  e)  *  t  models  built  without  anomalies  and  they 
will  all  have  zero  density  estimation  for  anomalies.  Only  e  *  t  models  built  from  subsamples  containing 
anomalies  may  have  low  density  estimations  for  anomalies2.  As  a  result,  the  densities  estimated  by  LiNearN 
for  anomalies  will  be  low. 

On  the  other  hand,  all  t  models  will  be  built  from  subsamples  containing  mostly  normal  points,  if  not  all 
normal  points.  Thus,  /(•)  will  estimate  the  densities  for  normal  points  to  be  higher  than  those  for  anomalies. 

Note  that  the  above  effect  is  the  same,  regardless  the  point  is  a  local  or  global  anomaly. 

It  is  interesting  to  note  that  k- nearest  neighbour  density  estimator  is  known  to  have  problems  in  identi¬ 
fying  local  anomalies  [11].  A  similar  example  provided  by  Breunig  et  al.  [11]  is  given  in  Figure  9,  where  the 
local  anomalies  are  estimated  to  have  the  same  density  of  the  sparse  cluster;  thus,  they  would  not  be  identi¬ 
fied  as  an  anomaly  by  fc-nearest  neighbour-based  anomaly  detector  such  as  ORCA,  as  shown  in  Figure  9(c). 

“These  models  could  still  produce  zero  density  estimation  for  anomalies  when  they  are  significantly  different  from  those 
appeared  in  the  training  subsamples. 
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(a)  Sample  data:  anomalies  are  marked  with  bold 
dots.  There  are  three  local  anomalies  which  are 
located  close  to  the  two  dense  clusters. 
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(b)  LiNearN  where  ip  =  16.  AUC  =  1. 
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(c)  fcNN  where  k  =  16.  AUC  =  0.96.  (d)  LOF  where  k  =  16.  AUC  =  1. 


Figure  9:  Examining  the  ability  to  detect  local  anomalies  using  LiNearN,  fcNN  and  LOF.  The  ‘inverse’ 
anomaly  scores  are  shown  for  LiNearN  and  /cNN.  Low  scores  indicates  normal  points;  whereas,  the  higher 
score  indicates  anomaly.  To  highlight  the  difference  between  anomalies  and  normal  points,  the  scores  shown 
are  a  result  of  subtracting  the  maximum  score  of  the  normal  points  such  that  all  normal  points  will  have 
(adjusted)  scores  below  or  equal  to  zero.  Anomalies  are  shown  as  black  lines  and  normal  points  as  grey 
lines. 


Both  LiNearN  and  LOF  can  detect  all  anomalies  as  shown  in  Figure  9(b)  and  9(d). 

In  contrast,  based  on  nearest  neighbour,  LiNearN  can  easily  detect  local  anomalies  for  the  reason  stated 
above.  Instead  of  computing  relative  density,  as  in  LOF  [11]  to  ‘correct’  the  deficiency  of  fc-nearest  neighbour 
procedure  in  identifying  local  anomalies,  LiNearN  employs  sampling  to  reduce  the  anomalies’  presence  in 
the  training  samples,  as  the  key  step  to  identify  local  and  global  anomalies. 

5. 2.  Clustering 

Table  8  shows  the  clustering  results  for  LiNearN-Cluster  and  DBSCAN.  In  the  OneBig  data  set  that  has 
nine  clusters  with  10000  noise  points,  LiNearN-Cluster  produced  ten  clusters  which  include  a  cluster  consists 
of  noise,  and  additional  3000  points  not  assigned  to  any  clusters.  DBSCAN  also  produced  nine  clusters  with 
slightly  more  than  ten  thousand  unassigned  points.  Both  algorithms  produced  F-measures  equal  to  one  for 
all  the  nine  clusters.  In  terms  of  runtime,  both  LiNearN-Cluster  and  DBSCAN  took  about  the  same  time 
to  complete  this  task. 

In  the  pendigits  data  set,  LiNearN-Cluster  produced  a  better  clustering  result  than  DBSCAN  in  terms 
of  the  number  of  clusters  and  the  number  of  unassigned  instances.  Because  LiNearN-Cluster  assigned  about 
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Table  8:  Clustering  results  for  the  OneBig  and  Pendigits  data  sets  using  LiNearN-Cluster  (ip  =  600  for 
OncBig;  ip  =  512  for  Pendigits;  ip  =  43  and  t  =  30000  for  Animals.  Other  default  settings  are  T  =  256 
and  MinDensity  =  6)  and  DBSCAN  (e  =  0.1  for  OneBig;  e  =  0.2  for  Pendigits;  e  =  0.7  for  Animals  and 
MinPts  =  6). 


OneBig 

Pendigits 

Animals 

LiNearN- 

Cluster 

DBSCAN 

LiNearN- 

Cluster 

DBSCAN 

LiNearN- 

Cluster 

DBSCAN 

Runtime 

8565 

8406 

821 

181 

38532 

240187 

#cluster 

[9]  10 

9 

[10]  18 

65 

[4]  4 

4 

4p  unassigned 

2920 

10005 

1495 

6251 

18597 

3347 

F-measure 

1.00 

1.00 

0.68 

0.75 

1.00 

1.00 

5000  more  instances  to  clusters,  it  has  slightly  lower  F-measure  than  DBSCAN.  In  this  relatively  small  data 
set,  DBSCAN  ran  faster  than  LiNearN-Cluster. 

The  Animals  is  an  interesting  synthetic  data  set  because  of  how  it  is  constructed.  It  has  approximately 
16000  mini-clusters,  each  has  either  12  or  13  points.  These  mini-clusters  are  grouped  into  4  clusters  with  2 
being  in  close  proximity  of  each  other.  With  e  =  0.3,  DBSCAN  produced  the  16000  mini-clusters.  In  order 
for  DBSCAN  to  detect  the  4  clusters,  e  needed  to  be  set  to  0.7.  In  this  kind  of  data  characteristics,  LiNearN 
requires  to  have  a  high  t  in  order  to  link  the  related  mini-clusters  into  a  cluster.  Using  ip  =  43  and  t  = 
30000,  LiNearN  produced  the  correct  4  clusters  but  had  approximately  1500  mini-clusters  unassigned. 

Table  9  shows  the  results  using  two  smaller  data  sets,  Iris  and  Yeast.  In  terms  of  unassigned  points  and 
the  number  of  clusters  for  Iris,  LiNearN-Cluster  produced  a  better  result;  whereas,  DBSCAN  has  almost 
half  of  the  data  points  unassigned.  Both  LiNearN-Cluster  and  DBSCAN  have  similar  F-Measure  scores. 
For  Yeast,  DBSCAN  has  almost  all  of  the  data  points  unassigned;  whereas,  LiNearN-Cluster  has  about 
one-third  unassigned.  In  terms  of  F-Measure,  LiNearN-Cluster  outperformed  DBSCAN.  LiNearN-Cluster 
has  longer  runtime  than  DBSCAN  in  these  small  data  sets.  Table  10  shows  similar  results  in  Segment  and 
WDBC  data  sets. 

None  of  the  above  results  reveal  the  time  complexities  of  the  algorithms.  Therefore,  we  conducted  a 
scaleup  test  using  the  same  data  set  as  used  by  Ting  et  al.  [33],  i.e.,  Ring-Curve- Wave- TriGaussian.  The 
data  characteristic  is  shown  in  Appendix  D. 

We  used  DBSCAN  with  and  without  indexing  using  R*-Tree  in  the  ELKI  platform  in  this  experiment. 
The  scaleup  result  is  shown  in  Figure  10.  Using  7000  instances  as  the  base,  the  data  size  were  increased  by 
a  factor  of  10,  75  and  150  to  reach  one  million  instances.  With  these  data  size  ratio  increases,  LiNearN- 
Cluster’s  runtime  ratio  were  increased  by  a  factor  of  12.5,  93.6  and  206,  respectively.  In  contrast,  without 
indexing,  DBSCAN’s  runtime  ratio  were  increased  by  a  factor  of  112,  6676  and  25754,  respectively;  with 
indexing,  the  ratio  were  88,  6080  and  20980,  respectively.  At  data  size  ratio  =  150  with  one  million  instances, 
DBSCAN  without  indexing  completed  the  task  in  98  hours;  whereas,  LiNearN-Cluster  finished  in  26  minutes. 
DBSCAN  with  indexing  completed  in  77  hours  which  gives  an  improvement  of  21.5%  over  DBSCAN  without 
indexing.  This  result  shows  the  advantage  of  LiNearN-Cluster  over  DBSCAN,  and  using  indexing  does  not 
improve  DBSCAN’s  runtime  significantly  for  problems  with  a  moderate  number  of  dimensions. 

5.3.  Summary 

While  the  proposed  approach  also  relies  on  distance  calculations  to  find  nearest  neighbour,  like  all  existing 
nearest  neighbour  algorithms,  there  are  important  differences.  First,  the  proposed  approach  reduces  the 
computationally  expensive  0(n 2)  nearest  neighbour  search  process  to  a  0(n )  search  process  which  involves 
a  small  subset  of  size  ip  <C  n.  This  is  demonstrated  in  both  anomaly  detection  and  clustering  tasks. 

Second,  as  shown  by  the  results  in  Table  6,  existing  fc-nearest  neighbour  anomaly  detectors  require  a 
significant  amount  of  search  to  find  an  appropriate  k  in  order  to  produce  good  results;  this  adds  to  its 
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Table  9:  Clustering  results  for  the  Iris  and  Yeast  data  sets  using  LiNearN-Cluster  (ip  =  32  for  Iris;  ip  =  160 
for  Yeast.  Other  default  settings  are  T  =  256  and  MinDensity  =  6)  and  DBSCAN  (e  =  0.1  for  Iris;  e  =  0.07 
for  Yeast;  and  MinPts  =  5). 


Iris 

Yeast 

LiNearN-  DBSCAN 
Cluster 

LiNearN-  DBSCAN 
Cluster 

Runtime 

0.51 

0.1 

21.3 

1.8 

^cluster 

[3]  4 

5 

[10]  15 

12 

#unassigned 

16 

70 

572 

1197 

F-measure 

0.90 

0.89 

0.33 

0.20 

Table  10:  Clustering  results  for  the  Segment  and  WDBC  data  sets  using  LiNearN-Cluster  (ip  =  400  and 
=  1024  for  Segment;  ip  =  30  and  4/  =  512  for  WDBC.  The  default  setting  is  MinDensity  =  6)  and 
DBSCAN  (e  =  0.1  for  Segment;  e  =  0.3  for  WDBC;  and  MinPts  =  6). 


Segment 

WDBC 

LiNearN- 

Cluster 

DBSCAN 

LiNearN- 

Cluster 

DBSCAN 

Runtime 

50 

7 

1.5 

1.3 

#cluster 

[19]  40 

43 

[2]  2 

2 

#unassigned 

417 

1282 

319 

331 

F-measure 

0.60 

0.62 

0.96 

0.89 

Ring-Curve- Wave-TriGaussian:  48  dimensions 


Figure  10:  Scaleup  test  for  LiNearN-Cluster  and  DBSCAN  using  the  48  dimensional  Ring-Curve- Wave- 
TriGaussian  data  set.  7000  instances  are  used  as  the  base  for  data  size  ratio.  The  data  size  is  increased  by 
a  factor  of  10,  75  and  150  which  has  one  million  instances. 
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already  heavy  computational  cost.  LiNearN’s  ability  to  use  low  ip  for  anomaly  detection  has  the  following 
advantages  over  existing  fc-nearest  neighbour  anomaly  detectors: 

•  Significantly  lower  runtime 

•  Require  a  search  in  a  small  range  of  ip  values  only. 

•  Indexing  schemes,  normally  required  to  speed  up  nearest  neighbour  search,  become  unnecessary. 

In  clustering,  both  LiNearN-Cluster  and  DBSCAN  require  a  search  of  the  key  parameters,  ip  and  e, 
respectively,  in  order  to  identify  connecting  local  regions  to  achieve  good  results.  Nevertheless,  LiNearN- 
Cluster  has  more  points  assigned  than  DBSCAN  in  six  out  of  the  seven  data  sets,  and  with  comparable 
results  in  terms  of  F-Measure  and  the  number  of  clusters. 

6.  Discussion 

The  nature  of  nearest  neighbour  approach  necessitates  0(n)  distance  calculations  in  order  to  find  the 
nearest  neighbours  in  a  data  set  of  size  n.  This  paper  shows  that  if  the  aim  is  to  do  density  estimation, 
then  0{n 2)  distance  calculations  are  not  required,  even  though  nearest  neighbour  approach  is  adopted.  The 
proposed  new  approach  opens  up  a  new  opportunity  for  many  tasks  in  which  nearest  neighbour  algorithms 
have  been  applied.  The  aims  of  these  tasks  need  to  be  carefully  examined  to  determine  whether  0(n2) 
distance  calculations  are  necessary.  If  they  are  not,  then  the  proposed  approach  is  a  way  to  convert  an 
0{n2)  algorithm  to  an  0(n)  algorithm.  Only  tasks  in  which  0{n2)  distance  calculations  are  absolutely 
necessary  that  the  current  research  focus  in  indexing  is  suitably  applied. 

Indeed,  we  show  in  anomaly  detection  tasks  that,  while  the  ranking  requires  density  estimation,  it  does 
not  require  0(n2)  distance  calculations  and  a  correction  suggested  by  LOF  to  enable  density  to  be  used 
directly  to  detect  local  anomalies.  In  clustering  tasks,  the  aim  is  to  identify  core  regions  and  link  all 
neighbouring  core  regions  to  form  a  cluster.  We  also  show  that  this  also  does  not  require  0(n2)  distance 
calculations  to  achieve  the  aim. 

It  is  interesting  to  identify  the  steps  in  the  process  where  the  speedup  was  achieved  by  LiNearN.  In 
anomaly  detection  tasks,  density  must  be  estimated  for  every  point  in  the  given  data  set.  The  density 
for  a  point  is  estimated  from  the  t  hypercubes  covering  the  point,  rather  than  invoking  n  —  1  distance 
calculations.  In  clustering  tasks,  the  speedup  occurs  in  two  steps.  First,  density  needs  to  be  estimated  for 
local  regions  only  in  LiNearN-Cluster.  In  contrast,  density  must  be  computed  for  every  single  point  in  the 
given  data  set  in  DBSCAN.  Second,  because  the  number  of  local  regions  is  significantly  smaller  than  the 
number  of  points  in  the  data  set,  the  number  of  links  required  to  form  clusters  become  significantly  smaller 
for  LiNearN-Cluster  than  that  for  DBSCAN. 

An  ensemble  of  fc-nearest  neighbours  does  not  usually  work  because  k- nearest  neighbour  classifiers,  in 
the  classification  context,  are  a  stable  learner  like  SVM  and  Naive  Bayesian  classifiers.  Most  of  the  ensemble 
approaches,  e.g.,  Bagging  [10]  and  Boosting  [19],  only  work  for  unstable  learners  such  as  decision  trees. 

Although  there  are  recent  ensemble  approaches  (e.g.,  Feating  [34]  and  Local  Model  [35])  that  have  been 
shown  to  work  for  stable  learners  such  as  fc-nearest  neighbour,  the  proposed  approach  has  importance 
differences.  First,  Feating  and  LocalModei  are  specifically  designed  for  classification  tasks  only;  whereas, 
LiNearN  is  a  density  estimator  which  has  a  wider  application  to  different  tasks.  Second,  LiNearN  employs 
a  subsample,  which  is  significantly  smaller  than  the  given  data  set,  to  build  each  model  in  the  ensemble.  In 
contrast,  Feating  and  LocalModei 3  build  individual  models  using  the  entire  data  set.  Third,  both  Feating 
and  LocalModei  use  a  tree  structure  to  define  local  regions;  LiNearN  uses  nearest  neighbours  to  define  local 
regions.  Fourth,  in  LinearN,  the  shape  of  the  local  regions  can  be  easily  changed  by  setting  the  p  parameter 
in  Lp-norm.  The  shape  of  the  local  regions  is  harder  to  change  for  both  Feating  and  Local  Model  if  not 
impossible. 


3Note  that  LocalModei  's  not  an  ensemble  approach  but  constructs  a  global  model  which  consists  of  many  local  models. 
Boosting  can  then  be  used  to  improve  the  predictive  accuracy  of  individual  LocalModei- 
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A  feature  bagging  method  [23]  has  been  proposed  to  build  multiple  LOF  models,  each  from  a  random 
feature  subset,  and  then  aggregate  the  LOF  scores  from  all  models  to  produce  the  final  score.  However,  this 
method  is  reported  to  be  marginally  better  than  single  LOF  in  five  out  of  six  data  sets  [23]. 

There  are  methods  to  reduce  the  number  of  instances  in  the  given  data  in  order  to  reduce  the  search 
cost  (e.g.,  [2,  14,  29,  31,  36,  37]).  These  methods  require  to  spend  significant  amount  of  time  in  the  instance 
reduction  process.  In  contrast,  the  instance  selection  process  in  LiNearN  is  a  random  sampling  process 
which  can  be  achieved  very  quickly. 

In  the  context  of  supervised  learning,  Salzberg  [28]  proposed  a  fc-nearest  neighbour  algorithm,  called 
Nested  Generalised  Exemplars  (NGE),  which  constructs  hyper-rectangles  to  replace  and  reduce  the  number 
of  correctly  classified  instances  while  storing  incorrectly  classified  instances  like  all  nearest  neighbour  algo¬ 
rithms.  The  purpose  of  NGE  differs  substantially  from  LiNearN.  NGE  is  a  classifier,  LiNearN  is  an  density 
estimator  that  can  be  used  for  various  pattern  recognition  tasks,  potentially  include  classification.  The 
algorithmic  differences  are:  NGE  is  a  single  model  and  stores  both  hyper-rectangles  and  instances;  whereas, 
LiNearN  is  an  ensemble  approach  which  stores  a  small  subset  of  instances  to  form  hypercube  regions. 

DEMass  [33]  is  a  closely  related  work  which  is  a  grid-based  method  that  has  a  global  parameter,  like 
existing  nearest  neighbour  density  estimators,  to  control  the  size  of  the  grid.  Like  all  grid-based  methods, 
the  grid  in  DEMass  has  a  single  size  which  does  not  adapt  to  different  data  distributions  in  local  regions. 
Thus,  LiNearN  is  expected  to  be  more  adaptive  to  data  distribution  in  local  regions  than  DEMass.  On  the 
other  hand,  LiNearN  can  be  expected  to  run  slower  than  DEMass  because  of  the  use  of  distance  measure. 

Like  the  Voronoi  diagram  [6],  LiNearN  divides  the  feature  space  using  the  nearest  neighbour  rule;  but 
they  have  important  differences.  First,  the  union  of  all  regions  in  the  Voronoi  diagram  covers  the  entire 
feature  space,  but  not  in  LiNearN.  Second,  the  Voronoi  diagram  is  a  result  of  all  instances  in  the  given  data 
set;  LiNearN  is  a  density  estimator  constructed  from  multiple  models  using  subsets  of  the  given  data  set. 

A  Voronoi  region  can  be  defined  as: 

Vc  =  {x  €  X\  ||a;  —  £c||  <  ||a:  —  aq||  V?'  ^  c  and  aq ,  xc  £  D}. 

In  contrast,  local  region  Hc  in  LiNearN  can  be  defined  as: 

Hc  =  {x  £  V|||x  —  a:c||  <  rc  and  xc  £  V  C  D}. 

If  V  =  D,  then  the  bisector  between  the  two  nearest  neighbours  in  V  will  be  the  same  for  Vc  and  Hc. 
Even  in  this  case,  the  volume  of  Hc  is  solely  determined  by  the  nearest  neighbour  of  xc\  whereas,  the  volume 
of  Vc  is  determined  by  all  nearest  neighbours  of  xc  in  all  directions.  In  addition,  the  shape  of  Hc  is  solely 
due  to  ZA-norm;  but  the  shape  of  Vc  relies  on  a:c’s  nearest  neighbours  and  ZA-norm. 

It  should  be  pointed  out  that  in  small  data  sets  the  proposed  method  runs  slower  than  the  existing 
nearest  neighbour  density  estimator  without  indexing.  This  is  also  true  for  any  indexing  scheme  which  has 
overheads  that  outweigh  its  use  in  small  data  sets.  Indexing  schemes  and  our  proposed  method  are  designed 
to  deal  with  large  data  sets  which  impose  serious  time  and  memory  constraints.  No  such  constraints  exist 
in  small  data  sets. 

7.  Conclusion  and  Future  Work 

By  rejecting  the  premise  that  a  nearest  neighbour  algorithm  must  find  the  nearest  neighbour  for  every 
instance  in  the  given  data  set,  we  propose  a  new  approach  to  produce  a  nearest  neighbour  density  estimator 
called  LiNearN.  It  is  the  first  nearest  neighbour  density  estimator  to  have  linear  time  complexity  and  constant 
space  complexity,  as  far  as  we  know.  In  contrast,  existing  nearest  neighbour  density  estimators  typically 
have  0(n2)  time  complexity  and  0(n)  space  complexity;  and  even  with  the  aid  of  an  indexing  scheme,  the 
time  complexity  can  at  best  be  reduced  to  near  linear  time  only.  LiNearN  achieves  linear  time  complexity 
without  any  indexing  scheme. 

Our  asymptotic  analysis  reveals  that  LiNearN  has  a  parameter  which  trades  off  between  bias  and  variance, 
as  in  existing  fc-nearest  neighbour  density  estimators. 
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We  assess  LiNearN  in  anomaly  detection  and  clustering  tasks  and  compare  with  three  state-of-the-art 
nearest  neighbour  algorithms,  ORCA,  LOF  and  DBSCAN.  LiNearN  produces  similar  results  compared  with 
these  algorithms  in  terms  of  task-specific  performance  measures,  but  it  runs  orders  of  magnitude  faster  than 
these  algorithms  in  large  data  sets. 

The  advantages  of  the  new  nearest  neighbour  approach  shown  in  these  two  tasks  imply  that  it  can 
potentially  be  adopted,  in  place  of  existing  nearest  neighbour  algorithms,  to  solve  other  pattern  recognition 
tasks.  In  each  of  these  tasks,  we  shall  first  examine  whether  the  aim  can  be  achieved  without  0(n2)  distance 
calculations.  LiNearN  may  be  able  to  reduce  the  time  and  space  complexities  of  kernel  density  estimators. 
This  is  an  interesting  open  question  that  deserves  future  investigation. 
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Appendix  A  -  Proof  for  Theorem  1 


The  ball  of  radius  r  centered  at  c  €  V  is  denoted  by  Bc(r).  Let  wc(r)  denote  the  probability  measure 
induced  by  (f>(x)  on  the  neighbourhoods  of  C, 

wc(r)  =  /  (f>{x)dx. 

J  Bc(r)nC 

wc(r)  >  0  on  C  since  Bc(r )  D  C  is  not  empty  and  (j)(x )  >  0  for  all  x  €  C.  Because  of  this  fact  and  the 
convexity  of  C ,  ccc(r)  is  strictly  monotonic  increasing  for  0  <  r  <  ro  for  some  ?’o  >  0  and  wc(r)  =  1  for 
r0  <  r.  Thus,  the  inverse  function  r  =  h(u)c)  exists,  and  the  following  holds  according  to  the  continuity  of 
4>{x)  on  C  and  Eq.  (5.15)  with  k  =  1  in  Evans  et  al.  [17]. 

E[r?]  =  (0  -  1)  [  h(coc)m{  1  -  tocf-2dtoc.  (Tl.l) 

Jo 

Within  the  interval  of  this  integral,  the  integral  over  [wc(<5),  1]  where  S  =  1  /0e  is  negligible  for  all  0  <  e  <  1/d 
by  Lemma  5.3  with  k  =  1  in  Evans  et  al.  [17].  With  this  fact,  the  convexity  of  C,  the  continuity  and  the 
bounded  partial  derivatives  of  4>(x)  on  C,  we  obtain  the  following  according  to  Eq.  (5.22)  in  Evans  et  al. 

[17], 


u>c(r)  =  ( <j)(c )  +  0(S))\Bc(r)  fl  C|  as  ip  — >  oo,  (T1.2) 
where  | Bc(r)  fl  C\  is  the  volume  of  Bc(r)  D  C. 

In  case  that  the  shortest  distance  between  c  and  the  surface  boundary  of  C  is  more  than  or  equal  to  <5, 

| Bc(r)  fl  C\  =  \Bc[r)\  =  a(d,p)r f.  Thus,  wc(r)  =  (c t>(c )  +  0(S))a(d,p)r as  i/’  oo  holds  by  Eq.  (T1.2).  In 
case  that  c  is  closer  than  5  from  the  surface  boundary  of  C,  there  exists  a  constant  0  <  v'  <  1  such  that 
j/a(d,p)r((  <  | Bc(r)  C\C\  by  Eq.  (2.1),  Proposition  (2.1)  and  Eq.  (2.3)  in  Evans  et  al.  [17].  In  concert  with 
|i?c(r)  fl  C\  <  a(d,p)r (?,  there  exists  0  <  v'  <  v(d,p,c)  <  1  and  | Bc(r)  ("1(71=  v(d,p,c)a(d,p)r holds.  By 
this  fact  and  Eq.  (T1.2),  wc(r)  =  (0(c)  +  0{5))v{d1p,  c)a(d,p)r^  as  0  — >  oo  holds  for  some  0  <  v(d,p,c)  <  1. 
Combining  these  two  cases, 


wc(r)  =  (0(c)  +  0(6))v(d,p,  c)a(d,p)r %  as  0  — >  oo  (T  1.3) 
for  some  0  <  v(d,p,c)  <  1. 

By  following  the  manipulation  from  Eq.  (5.24)  to  Eq.  (5.33)  in  Evans  et  al.  [17]  with  k  =  1,  our  Eq.  (Tl.l) 
and  (T1.3),  we  obtain 


E[C) 


T(m/d+l)  1 

{v(d,p,c)a(d,p)(j)(c)}m/d  (0  +  l)m/d 


(1  +  0(0))  as  0  — >  oo. 


Because  O(l/(0  +  1  )m/d)0(6)  =  0(l/ipm^d+e),  Eq.  (7)  follows. 


□ 
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Appendix  B  -  Proof  for  Theorem  2 

Let  Cc  be  a  hypercube  such  that  Cc  =  II^=1[cj  —  rc,  Cj  +  rc]  where  Cj  is  the  j- th  element  of  c  G  C.  Note 
that  Hc  C  Cc  for  any  p  >  0  of  Lp  distance  measure.  Also,  we  use  Pr(Hc)  to  denote  Pr( x  €  Hc|:r  G  55) 
for  brevity.  In  addition,  according  to  the  bounded  first  and  second  order  partial  derivatives  of  (f>(x)  at  each 
point  x  £  C,  we  denote  their  bounds  as 


d<t>{c)  i 

dcj  lc6Hc 


<  C )  and 


d2(j>(c) , 
dcj  |c6Hc 


<b2(4>,C). 


(T  2.1) 


Furthermore,  the  second  order  Taylor  approximation  of  <j)(x)  around  the  center  c  of  Hc  is  given  as 


<H®)|ceiac  ~  ^(c)|cshc  +  (x  -  c)tV0(c)|c6hc  (T2.2) 

+  ^{(a:-c)TV}2(/>(c)|c6Hc, 

where  V  =  [d/dx\, . . . ,  d/dxd]T ■ 

From  Hc  C  Cc,  (T2.1),  (T2.2)  and  the  fact  that  the  integral  of  an  odd  function  around  c  over  its 
symmetric  region  is  zero,  we  obtain  the  following. 


Pr(Hc) 

|Hc| 


w\Lmdy 

<t>{c) |c6Hc  +  jj^-|  \{(y  -  c)TV}2^(c)|c6Hcdy 


=  </>(c)|ceHc 


a(d,p)r dc  JMc  2  ^ 


-  Cj)2  "  IceM  dy 


<  ^(c)|ceH0  + 


a(d,p)r i  JCc  2  ^ 


Ty^Z^yj-Cj) 


2  d2(f){c)  . 
d2cj){c) 


dc2 


|c6Hc 


dy 


-  ^(c)|ceH'+  3  a(d,p)  Vc  (T2'3) 


Because  £(x)  is  computed  from  two  subsamples  V  and  55  which  are  mutually  independent,  the  expectation 
P[-]  consists  of  two  independent  expectations  over  V  and  over  55  denoted  as  ET>[-]  and  P55[-],  respectively. 

1 55 (1HC) |  under  a  given  Hc  follows  a  binomial  distribution  P(vh,  Pr(Hc))  over  55  where  its  expected  value  is 


P55[|55(HC)|]  =  \E'Pr(Hc),  (T2.4) 


and  its  variance 


P55[{|S)(HC)|  -  P55[|55(HC)|]}2]  =  WPr(Hc)(l  -  Pr(Hc)).  (T2.5) 

We  denote  Hl  =  {H*|  c  G  V1}  for  the  z-th  subsampling  in  Eq.  (5)  and  rc>i  for  the  radius  of  H*.  Then, 
from  Eq.  (3),  (5),  (T2.1),  (T2.2),  (T2.3)  and  (T2.4),  the  square  bias  of  ((x)  is  evaluated  as  follows.  Note 
that  | Xj  —  Cj  |  <  rcnn(x)ti  holds  for  each  dimension  j  since  x  is  always  in  H lcnntx}  for  each  z.  So,  we  apply 
inequality  in  the  last  line. 
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{£[«*)]  -#C)}2  -  T  J2EV 


2=1 

t 


7E  (*> 


*!»„„(*)  I 

Pr(mLn(X)) 


-  <t>{x) 


i= 1 


1^ 


cnn(x)  I 


-0(c) IceH*  -  (*-  c)  V0(c)|cejHi 

cnn{x)  cnn(x) 


< 


—  ^{(^  -  c)  V}  0(c)|c6Hl 

/  cnn(i) 


d2d-lB2{(j),c) 


3a(d,p)  EV^nn(x),i]  +  dB i(0.  C)£;D[rcn„(x);i] 
d2B2(cf>,C) 

- 2 - E:DFcnn(x),i]  )  |  •  (T2-6) 


From  Eq.  (7)  with  to  =  1  or  2,  the  term  ED[rcnn(x)^\  dominates  in  terms  of  ip  in  Eq.  (T2.6).  Though 
this  analysis  uses  the  second  order  approximation  of  <p{x),  the  result  using  the  higher  order  approximation 
is  the  same,  since  the  first  order  term  dominates  the  magnitude.  This  gives  Eq.  (8). 

Next,  from  Eq.  (3),  (5),  (T2.1),  (T2.2),  (T2.3)  and  (T2.5),  the  variance  of  ((x)  is  evaluated  as  follows. 
Note  that  Eq.  (7)  indicates  that  the  magnitudes  of  1/|1HP  J  and  £?X>[1/|1HP  ,  ,|]  have  the  same  order  for 

ip.  So,  we  introduce  an  approximation  to  remove  ET>[-]  in  the  inner  summation  on  the  third  line. 


E[{ax)-Eiax)]y2 


=  EV 


EV 


ED 


ED 


£ 

2=1 

'l  * 


t^2 


Y.EV 


ED 


l®(ffc™(*))l 

1  \  - 

2=1 

ill 

mLn{x)\ 

*  1^  „n(x)l 

I. 

Vt'lMc„n(,)l  J  j 

{l®(^(x))|-^[l®(ffc™(:C))l]}2] 

\Wcnn{x)\2 


1 


1 

- 


±EV 

2=1 

±EV 

2=1 


Pr(W  /  x)  1  —  Pr(W  (  0 

^  cnn(x)'  v  cnn(x)' 

in*  /  J  iei  f  J 

1  cnnyx) 1  1  cnn(x) 1 


d2d~1B2{(j),c) 

- 7 

3  a(d,p) 


%(x),i 


1 

“(rf-P)rin(x),i 


+ 


d2d~1B2((t>,c)  2 
3  a(d,p)  Tcnn^d 


1 

m 


U2d-1B2{(p,c) 
\  3  a{d,p)2 


ET>[r 


-d+2  l 
cnn(x),ii 


cp22d~2  B2(<p,  c)2 
9  a(d,p)2 


EV[r 


4 

cnn(x) 


(T  2.7) 


From  Eq.  (7)  with  m  =  4  or  d  —  2,  the  first  term  in  the  summation  of  Eq.  (T2.7)  always  dominate  the 
magnitude  of  this  expression.  If  d  =  1,  the  variance  is  Of^ip^1)  from  Eq.  (7)  with  to  =  1.  If  d  >  2,  it  is 
0(^1_2/d+e)  from  Eq.  (7)  with  m  =  d  —  2.  Similarly  to  the  analysis  of  the  square  bias,  this  uses  the  second 
order  approximation  of  </>(x).  However,  the  result  using  the  higher  order  approximation  is  the  same,  since 
the  second  order  term  dominates  the  magnitude.  Thus,  Eq.  (9)  is  derived.  □ 
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Appendix  C  -  LiNearN-Cluster  Algorithms 


The  procedures  for  LiNearN-Cluster  are  provided  in  the  following  three  algorithms. 


Algorithm  5:  CoreRegions(C) 


input  :  C  -  {Hl\i  =  1, . . . ,  t}. 

output:  {Hl\i  =  1, . . . ,  s}  containing  only  core  regions. 


1 

2 

3 

4 

5 

6 

7 

8 
9 


for  i  =  1  to  t  do 

foreach  H  £  //  '  do 

density  -f —  M.mass /M.radius; 
if  density  <  MinDensity  then 
Hl  <-  W  \  {H}; 

end 

end 

end 

return  {Hl\i  =  1, . . . ,  s}  containing  only  core  regions; 


Algorithm  6:  FindPointsWithinCoreRegions(D,  C) 

input  :  D  -  input  data,  C  :  {Hl\i  =  1, . . . ,  s}  containing  only  core  regions, 
output:  D  containing  only  points  found  within  core  regions. 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


foreach  x  £  D  do 

found  4—  false ; 
for  i  =  1  to  s  do 

H  search(Hl ,x)  {return  a  H  that  covers  x}\ 

if  (H  is  not  NULL)  then 
found  4—  true; 

break  i  loop  {found  a  core  region  that  x  falls  into.}; 

end 

end 

if  not  found  then 
D^D\{x}; 

end 

end 

return  D  containing  only  points  found  within  core  regions; 
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Algorithm  7:  LiNearN-Clusters(D,  C) 


input  :  D  -  containing  only  points  found  within  core  regions,  C  :  {Hl\i  =  1, . . . ,  s}  -  containing  only 
core  regions. 

output:  {Hl  \i  =  1, . . . ,  s}  with  id  assigned  to  each  core  region. 

1  idNext  G-  0; 

2  P  f—  0; 

3  foreach  x  G  D  do 

4  found  f—  false ; 

5  {Search  for  the  first  assigned  core  region}; 

6  for  i  =  1  to  s  do 

7  H  search(Hl,x)  {return  a  H  that  covers  x}; 

8  if  (H  is  not  NULL)  and  isAssigned(M.)  then 

9  idCurrent  f—  El.id; 

10  found  f—  true ; 

11  break  the  i  loop  {found  a  core  region  that  has  been  assigned  to  an  id}; 

12  end 

13  end 

14  if  not  found  then 

is  idNext  f—  idNext  +  1; 

16  idCurrent  t—  idNext ; 

17  end 

is  {Set  all  unassigned  regions  that  x  falls  into  to  idCurrent }; 

19  for  i  —  1  to  s  do 

20  if-  search(Hl,x)  {return  a  H  that  covers  x}; 

21  if  (H  is  not  NULL)  then 

22  if  isAssigned(M)  then 

23  if  idCurrent  H .id  then 

24  {Current  IHI  is  already  assigned  but  has  a  different  id;  make  a  note  and  then  merge 

into  one  id  in  step  36}; 

25  P.idFirst  f—  El  .id; 

26  P.idSecound  g-  idCurrent ; 

27  P  P(J{P}; 

28  end 

29  end 

30  else 

31  |  El. id  f—  idCurrent ; 

32  end 

33  end 

34  end 

35  end 

36  merge( C,P)  {merge  all  ‘linked’  ids  into  one  id}; 

37  return  {Hl\i  —  1, . . . ,  s}  with  an  id  assigned  to  each  core  region; 


Appendix  D  -  Data  characteristic 

The  characteristic  of  the  data  set,  Ring-Curve- Wave-TriGaussian,  used  in  Section  5.2  is  shown  in  Fig¬ 
ure  .11. 


(b)  Wave  (c)  Tri-Gaussian 

Figure  .11:  Scatter  plot  of  the  clusters  in  the  Ring-Curve- Wave-TriGaussian  data  set,  as  used  in  [33]. 
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Mass-based  Similarity  Measure: 

An  Effective  Alternative  to  Distance-based  Similarity  Measures 


Submitted  for  Blind  Review 


Abstract — This  paper  introduces  a  unique  similarity  measure 
that  does  not  compute  distance.  Instead,  it  is  based  on  the  cardi¬ 
nality  of  the  smallest  local  region  covering  the  two  instances  for 
which  the  similarity  is  measured.  Theoretical  analysis  reveals 
that  it  is  a  generalisation  of  mass  estimation.  To  show  its  utility, 
a  new  information  retrieval  system  is  created  based  on  the  new 
similarity  measure.  Empirical  evaluations  demonstrate  that  the 
new  system  has  a  significantly  better  retrieval  performance 
than  five  state-of-the-art  systems  in  image  and  music  retrievals. 

Keywords- theory;  data  mining;  unsupervised  learning; 

I.  Introduction  and  motivation 

Data  mining  algorithms  have  traditionally  relied  on  sim¬ 
ilarity  measures  to  gauge  the  similarity  between  two  in¬ 
stances,  as  the  core  operation  to  solve  various  data  mining 
problems.  For  example,  anomaly  detection  requires  ranking 
of  instances  in  a  database  according  to  their  degrees  of 
anomaly;  an  information  retrieval  task  ranks  instances  in  a 
database  which  are  most  similar  to  a  query.  These  ranking 
tasks  are  traditionally  accomplished  by  computing  the  sim¬ 
ilarity  or  distance  between  two  instances  as  the  key  step  to 
calculate  the  ranking.  Because  the  ranking  is  with  respect 
to  the  entire  database,  the  similarity  must  be  computed  for 
all  pairs  of  instances  in  the  database  for  anomaly  detection, 
and  between  the  query  and  every  instance  in  the  database 
for  information  retrieval. 

This  paper  is  motivated  by  a  recent  content-based 
multimedia  information  retrieval  (CBMIR)  system  called 
ReFeat  [1].  ReFeat  is  unique  in  two  aspects.  First,  it 
uses  a  similarity  measure  which  is  primarily  based  on  data 
distribution  in  the  local  region.  In  contrast,  commonly  used 
distance  measures  are  solely  based  on  the  positions  of  in¬ 
stances  in  the  feature  space.  Second,  at  the  heart  of  ReFeat 
is  an  anomaly  detector  which  provides  a  ranking  score  f  (x) 
for  an  instance  x,  independently  of  other  instances.  This 
is  fundamentally  different  from  most  ranking  measures  that 
rely  on  a  distance  measure  to  compute  the  distance  of  an 
instance  relative  to  another  instance,  dist(x,  y)  (e.g.,  ORCA 
[2],  Qsim  [3]). 

The  use  of  such  a  unique  similar  measure  is  the  key  reason 
why  ReFeat  has  produced  better  retrieval  performance  than 
state-of-the-art  CBMIR  systems  including  manifold  learning 
method  MRBIR  [4],  Bayesian  learning  method  BALAS  [5], 
query-sensitive  ranking  methods  InstRank  [6]  and  Qsim  [3], 

Despite  its  unique  approach  and  demonstrably  excellent 
retrieval  performance,  the  ReFeat  paper  [1]  does  not 


provide  a  satisfactory  explanation  as  to  why  a  unary  score 
function  could  produce  an  appropriate  ranking  of  database 
instances  for  a  query  which  requires  a  binary  function.  More 
to  the  point:  ReFeat  does  not  guarantee  that  two  ‘similar’ 
instances,  having  a  similar  ranking  score  ((■),  are  in  the  same 
local  neigbourhood. 

This  paper  investigates  the  source  of  the  power  of 
ReFeat.  From  a  foundation  in  mass  estimation  [7],  we 
derive  a  new  mass-based  similarity  measure  that  enables  a 
new  CMBIR  system  to  significantly  improve  the  retrieval 
performance  of  ReFeat. 

The  contributions  of  this  paper  are; 

1)  Introducing  a  unique  similarity  measure,  Massim,  and 
establishing  its  theoretical  foundation  based  on  mass. 

2)  Creating  a  new  CMBIR  system  called  MassIR  based 
on  this  similarity  measure. 

3)  Empirically  evaluating  MassIR  in  comparison  with 
ReFeat  and  systems  which  employ  commonly  used 
similarity  measures,  and  showing  its  superiority  in 
image  and  music  information  retrievals. 

Massim  has  the  following  characteristics: 

•  Unlike  the  similarity  measure  used  in  ReFeat, 
Massim  guarantees  that  two  similar  instances  are  in 
the  same  local  neighbourhood. 

•  Unlike  distance-based  similarity  measures,  it  does  not 
compute  distance  and  primarily  based  on  data  distribu¬ 
tion  in  the  local  region. 

•  It  is  a  generalisation  of  mass  estimation.  Under  certain 
conditions,  it  reduces  to  mass  estimation  [7], 

The  rest  of  the  paper  is  organised  as  follows.  Sections  II 
and  III  describe  the  related  work  and  ReFeat,  respectively. 
We  introduce  the  intuition,  the  new  similarity  measures  and 
the  implementation  in  the  next  three  sections.  The  new 
information  retrieval  system  and  the  experimental  results  are 
described  in  Section  VII.  Discussion  and  the  conclusions  are 
provided  in  the  last  two  sections. 

II.  RELATED  WORK 

Similarity  is  a  concept  that  is  used  extensively  not  only 
in  data  mining,  but  also  in  many  other  fields  such  as 
psychology  [8]  and  ecology  [9],  There  are  different  types 
of  similarity  measures  applied  for  different  tasks.  We  refer 
the  reader  to  [10]  and  [11]  for  a  rich  collection  of  similarity 
measures. 


Table  I:  Unary  ranking  function  for  Similarity  measure 


Ranking  measure 

Similarity  measure 

Function 

f(x) 

Similarity (x,  q) 

Model  -  Purpose: 

Describe  the  data  profile  for  each  x 

Describe  the  similarity  for  any  x  and  q 

-  iTree: 

x  having  long  path  length  from  an  iTree  is 
relevant  to  the  data  profile;  and  x  having 
short  path  length  is  irrelevant 

ReFeat  employs  t  iTrees  which  map  Rd  to  R*,  where  each  iTree 
profiles  one  aspect  of  the  database  as  a  relevance  feature  in  the  new 
feature  space. 

similarity  (x.,  q)  =  score(x|q)  =  —  l)^(x) 

Distance  based  similarity  measures  are  often  tested  against 
four  distance  axioms  to  determine  whether  they  are  a  metric. 
The  suitability  of  similarity  measures  that  adhere  to  all 
four  axioms  have  been  challenged  by  various  researchers 
[12],  [13].  Tversky  [13]  and  Krumhans  [14]  have  questioned 
the  validity  of  using  geometrical  models  as  the  measure 
of  similarity.  They  introduced  set  theoretic  and  density- 
augmented  geometrical  alternatives,  respectively. 

Many  nonmetric  similarity  measures  [15]  have  been  de¬ 
signed  to  explicitly  violate  the  triangle  inequality  axiom. 
While  this  enhances  similarity  modeling  (to  model  human 
perception),  systems  that  use  these  nonmetric  measures  are 
inefficient  because  they  cannot  use  many  indexing  schemes 
(that  rely  on  the  triangle  inequality)  to  prune  the  search 
space.  To  improve  efficiency,  research  has  focused  on  ways 
to  (i)  enforce  or  approximate  the  triangle  inequality  while 
preserving  the  desired  similarity  orderings  [9],  [16],  [17]; 
and  (ii)  condense  the  given  data  set  into  a  smaller  set  of 
representative  instances,  and  then  employ  a  classification 
method  to  find  the  representative  instance  which  is  most 
similar  to  the  query  [18],  [19]. 

Irrespective  of  whether  metric  or  nonmetric,  all  existing 
similarity  measures  are  binary  functions. 

iForest  [20]  employs  a  unary  function  to  score  each 
instance  and  was  designed  specifically  for  anomaly  detec¬ 
tion.  ReFeat  [1],  mentioned  in  the  introduction,  adopted 
iForest  to  solve  information  retrieval  problems.  ReFeat  has 
created  a  binary  function  to  measure  the  similarity  between 
a  query  q  and  a  database  instance  x.  But,  the  function  is 
not  strictly  a  similarity  measure  because  the  relevance  of  q 
and  x  is  measured  against  a  reference  model  (i.e.,  iForest) 
independently  and  the  model  does  not  guarantee  q  and  x  to 
be  in  the  same  local  neighbourhood  even  though  both  have 
similar  scores. 

In  this  paper,  we  establish  a  principled  approach  which 
ensures  that  two  instances  are  similar  if  they  are  in  the  same 
local  neighbourhood. 

The  new  similarity  measure,  Massim,  is  fundamentally 
different  from  existing  distance  based  similarity  measures 
because  it  is  based  on  mass  [7]  rather  than  distance.  Mass 
estimation  [7]  was  recently  introduced  as  an  alternative  to 
density  estimation  to  solve  data  mining  problems.  Mass 
models  the  centrality  of  a  data  cloud  whereas  density  models 
the  compactness.  Mass  estimation  provides  the  theoretical 
foundation  for  iForest;  and  like  the  measure  used  in  iForest, 
mass  is  a  unary  function.  Here  we  use  mass  to  define  a 
binary  function  to  measure  similarity. 


III.  ReFeat:  a  unary  function 

FOR  SIMILARITY  MEASUREMENT 

ReFeat  [1]  maps  the  original  database  with  d  features 
to  a  database  with  new  t  relevance  features;  and  computes 
the  similarity  in  the  new  space.  The  value  of  each  relevance 
feature  is  derived  from  an  iTree,  i.e.,  £,(x)  the  path  length 
of  x  traversing  iTree  i.  The  similarity  score  of  a  database 
instance  x  with  respect  to  an  query  q  is  expressed  as 
the  weighted  average  of  t  relevance  feature  values,  where 
the  weight  for  feature  i  is  tUj(q)  =  —  1,  and  c  is 

a  normalisation  constant.  The  similarity  function  taking  a 
query  q  and  a  database  instance  x  produces  a  high  (low) 
score  if  both  x  and  q  have  long  (short)  path  lengths  on 
many  relevance  features.  Table  I  provides  a  brief  summary  of 
how  ReFeat  employs  a  unary  ranking  measure  to  measure 
similarity  between  two  instances. 

ReFeat  relies  on  iTrees  to  be  unbalanced.  Having  dis¬ 
tinct  long  and  short  path  lengths  in  each  iTree  is  a  pre¬ 
requisite  to  identify  similar  instances.  Balanced  iTrees  al¬ 
ways  produce  the  same  path  length;  thus  they  have  no  way 
to  differentiate  instances  of  different  feature  characteristics. 
Zhou  et.  al.  [1]  analyse  empirically  that  iTrees  are  more 
likely  to  be  unbalanced  than  balanced. 

However,  the  analysis  does  not  consider  the  fact  that,  even 
if  both  have  long  path  lengths,  there  is  no  guarantee  that  x 
and  q  will  fall  into  the  same  local  neighbourhood  in  the 
feature  space  as  f(x)  and  £(q)  are  assessed  independently, 
albeit  against  the  same  reference  model,  i.e.,  iTree. 

ReFeat  works  because  it  has  employed  a  large  number 
of  iTrees  (where  t  =  1000  in  their  experiments)  such  that 
similar  instances  are  likely  to  have  long  path  lengths  in 
the  same  local  neighbourhood  of  many  iTrees;  though  some 
iTrees  giving  long  path  lengths  may  not  have  both  instances 
in  the  same  local  neighbourhood. 

In  a  nutshell,  ReFeat  derives  its  retrieval  power  from 
a  ranking  model  that  profiles  the  data  distribution  and  it 
scores  an  instance’s  relevancy  according  to  the  profile.  But 
its  ranking  score  is  based  on  a  unary  function  which  does 
not  guarantee  that  similar  instances  are  in  the  same  local 
neighbourhood.  We  show  in  this  paper  that  overcoming  this 
weakness  significantly  improves  the  retrieval  performance  of 
ReFeat. 

It  is  important  to  note  that  iTree  is  just  one  realisation 
of  mass  estimation  [7],  implemented  using  a  tree  structure 


Table  II:  Mass-based  similarity  measure  versus  distance-based  similarity  measure 


Mass-based  similarity  measure 

Distance-based  similarity  measure 

Computation 

Mass(x,  y)  is  primarily  based  on  data  distribution  in  the  local  region 
of  the  feature  space. 

dist(x. ,  y)  is  solely  based  on  the  positions  of  x  and  y  in  the  feature 
space. 

Definition 

Mass  base  function  M(x,  y)  measures  the  cardinality  of  the  smallest 
local  region  covering  both  x  and  y. 

dist  (x,  y)  measures  the  length  of  the  shortest  path  from  x  to  y. 

Inequality 

similarity (x.,y)  >  similarity (x.,  z)  = 

Mass(x,y)  <  Mass(x.z) 

similarity (x,  y)  >  similarity (x,  z)  = 
dist (x,  y)  <  dist (x,  z) 

Metric 

The  measure  does  not  satisfy  some  distance  axiom. 

All  distance  axioms  usually  hold. 

V.  Mass-based  similarity  measures 


such  that  path  length  is  a  proxy  to  mass1.  In  general,  a  data 
distribution  can  be  expressed  as  a  mass  distribution,  instead 
of  density  distribution,  and  the  mass  distribution  can  be  used 
as  a  basic  data  modelling  tool  to  solve  data  mining  problems 
(see  [7],  [21]).  The  data  profile  produced  by  iTrees  has  a 
one-to-one  correspondence  to  mass:  high  (low)  path  length 
implies  high  (low)  mass  in  the  data  distribution. 

The  new  information  retrieval  system  we  present, 
MassIR,  differs  from  ReFeat  in  that 

•  MassIR  measures  similarity  using  a  binary  function 
whereas  ReFeat  is  based  on  a  unary  function. 

•  MassIR  guarantees  similar  instances  to  be  in  the  same 
local  neighbourhood. 

•  The  implementation  of  Massim  uses  balanced  trees 
whereas  ReFeat  relies  on  unbalanced  trees  to  work. 

•  MassIR  and  ReFeat  behave  differently  (see  section 
VII-B). 

IV.  Intuition 

We  will  use  the  simplest  form  of  mass,  which  is  the 
cardinality  of  a  region  [7],  to  provide  the  intuition  in  this 
section. 

Let  r,  be  a  convex  local  region;  and  r,  be  the  cardinality 
of  region  r,;  or  in  other  words,  the  number  of  instances  from 
the  database  that  the  region  contains. 

Mass  inequality  holds  true  as  follows:  r  7- \  <  \ri\  for  r7  C 
ri  if  regions  ri,  i  €  {1, . . . ,  m}  created  by  a  model  satisfy 
the  conditions:  Vi,  j;  i  ^  j  r.j  ^  0  and  \  rj  ^  0. 

With  the  assurance  of  the  mass  inequality,  similarity  of 
instances  x,  y,z  can  be  inferred  as  shown  in  the  follow¬ 
ing  example:  if  r\  =  {x,  y,  z }  and  r2  =  {x,  y}  then 
|i?(x,  y)|  <  |i?(x,  z) | ,  where  R( a,  b)  is  the  smallest  region 
covering  both  a  and  b;  r±  and  r2  are  the  regions  created 
by  a  model.  This  means  that  if  x  is  in  a  local  region  of 
few  instances  with  y;  and  x  is  in  a  local  region  of  many 
instances  with  z,  then  x  is  more  similar  to  y  than  z. 

We  propose  to  use  the  cardinality  of  the  smallest  convex 
region  covering  both  a  and  b  as  the  base  function  to  measure 
similarity  between  any  two  instances.  Table  II  highlights  the 
differences  between  this  mass-based  similarity  measure  and 
existing  distance-based  similarity  measures. 

Based  on  the  simplest  form  of  mass,  more  sophisticated 
forms  of  mass  are  derived  in  the  next  section. 

'The  path  length  is  computed  as  L  +  f(ni).  where  L  is  the  path  length 
traversed  from  the  root  to  an  external  node,  and  f(m)  is  a  function  which 
estimates  the  average  path  length  of  an  unexpanded  subtree  for  a  training 
subset  of  size  in. 


The  theoretical  basis  of  mass-based  similarity,  Massim, 
is  given  here.  We  introduce  a  new  one-dimensional  similarity 
measure  in  section  V-A,  provide  its  axioms  in  section  V-B, 
and  describe  the  multi-dimensional  version  in  section  V-C. 

A.  One-dimensional  similarity  measure 

Definition  1:  R(x,y\E)  is  the  smallest  local  region  cov¬ 
ering  x  and  y  in  a  given  model  E,  which  partitions  the 
feature  space  into  convex  local  regions,  where  i,i/€K: 

R(x,y\E)  =  r*  such  that  |r*|  =  min reE  M  and  x,y  £  r. 

E  must  be  chosen  to  ensure  that  R(x,y\E)  is  a  unique 
region  that  covers  both  x  and  y.  Let  E(si)  =  {?’i,r2,r3} 
be  a  model  as  a  result  of  binary  split  st  in  the  real  line 
defined  by  D  =  {x\,X2 ,  ...,xn};  let  Si  be  a  binary  split 
between  Xi  and  Xi+ ±  which  divides  the  real  line  into  two 
non-overlapping  local  regions,  rq  (where  x  <  s,)  and  r2 
(where  x  >  s,)\  and  r3  covers  the  entire  real  line. 

Mass  base  function  Mi{x,  y)  denotes  the  mass  of  local 
region  R(x,  y\E(si)). 

Definition  2:  Mass-based  similarity  measure  Mass(x,y) 
is  defined  using  D  as  a  weighted  power  mean  of  a  series  of 
Mi(x,y)  weighted  by  p(si)  over  n—  1  splits  as  follows: 

n—  1 

Mass(x,y)  =  Mi(x,y)ep(si ))=  (1) 

i=l 

where  p(s,)  is  the  probability  of  selecting  the  binary  split 

Si  and  J2iP(si )  =  !;  and 

(  i  if  R(x,y\E)  =  n 

Mi(x,y)  =  <  n-i  if  R(x,y\E)  =  r2 

{  n  if  R(x,  y\E)  =  r3 

For  an  example  of  five  points,  X\  <  x-2  <  ...  <  xr,  and 
e  =  1,  Massfx i,-)  for  x-y  and  each  of  the  five  points  are 
given  as  follows: 

Mass(x  y,x\)  =  lp(si)  +2 p(s2)  +  3 p(s3)  +4 p(s4) 
Mass(x  1,2:2)  =  5p(si)  +2 p(s2)  +3 p{s3)  +4p(s4) 
Mass(xi,x3)  =  5p(si)  +  5 p(s2)  +  3 p(s3)  +  4p(s4) 
Mass{x\,Xb)  =  5p(si)  +  5 p(s2)  +  5  p{s3)  +  Ap{s±) 
Mass(xi,x5)  =  5p(si)  +  5 p(s2)  +  5  p(s3)  +  5p(sn) 

The  components  of  the  summation,  M,  (xi .  •)  due  to  each 
split  .s,;,  are  illustrated  in  Figure  1. 


Table  III:  Axioms  used  for  Mass-based  Similarity  and  distance-based  similarity 


Mass-based  Similarity  Distance-based  Similarity 

Axiom  1  Mass(x ,  y)  >  1  dist(x ,  y)  >  0  (non-negativity) 

Axiom  2  i.Vrr,y  Mass(x,  x)  <  Mass(x,y)  dist(x,y)  =  0  <$=>•  x  —  y  (identity  of  indiscemibles) 

ii.  3x  ^  y  Mass(x,x)  ^  Mass(y,y) 

Axiom  3  Mass(x,y)  =  Mass(y,x)  dist(x,y)  =  dist(y,x)  (symmetry) 

Axiom  4  Mass(x,  z)  <  Mass(x,  y)  +  Mass(y ,  z)  dist(x,  z)  <  dist(x ,  y)  +  dist(y,  z)  (triangle  inequality) 

Definition  3:  Level -/i  Massh(x,y)  defined  using  D, 
where  1  <  /i  <  n,  is  expressed  as: 

n—  1 

Massh(x,  y)  =  Mass^1  (x,y)ep{si ))«  (2) 

2=1 


MilXi.Xj)  =  1 

M2(x1,x1)  =  2 

Mi(Xi,x2)  =  5 

M2(X!,x2)  =  2 

Mi(Xi,x3)  =  5 

M2(x1,x3)  =  5 

Mi(Xi,x4)  =  5 

I'WXi.xJ  =  5 

Mi(x,,x5)  =  5 

M2(Xi,x5)  =  5 

Xl 

X2  x3  X4  X5  Xl  x2 

x3  x4  x5 

Si  S2 


(a)  Mi(xi,aia)  due  to  si  (b)  M2(xi,xfl)  due  to  S2 


M3(x1,x1)  =  3 
M3(Xi.x2)  =  3 
M3(Xi,x3)  =  3 
M3(xlfx4)  =  5 
M3(XifXs)  =  5 

XX  t2  X4  ^5 

S3 


M4(xlfxx)  =  4 
M4(xlfx2)  =  4 
M4(xlfx3)  =  4 
M4(xlfx4)  =  4 
M4(xlfx5)  =  5 

H - 1 - 1 - 1 - - 1- 

Xx  x2  x3  x4  x5 

s4 


(c)  Ms(xi,xa)  due  to  S3  (d)  M4(xi,xa)  due  to  S4 

Figure  1:  Example  of  mass  base  function  .)  due  to 

each  of  four  binary  splits  si,  S2,  S3,  S4. 


where  Massf^~1(x,y)  is  Massh~1(x,y)  defined  using 
Di  C  D,  i.e.,  the  data  subset  covered  by  R(x,y\E(si)). 

Multi-modal  data  distribution  requires  level-// 
Massh{x,y)  that  takes  into  account  the  nature  of  the 
data  distribution  more  effectively. 

Note  that  the  distribution  of  Massh(x1y)  (for  e  =  1) 
reduces  to  mass  distribution  massh{x)  as  defined  by  [7] 
when  y  =  x: 


(a)  Distributions  of  Massh(x,y  =  0.9) 


(b)  Distributions  of  Massh(x,x) 

Figure  2:  Distributions  of  Massh(x,y)  and  Massh(x,x)  in 
a  data  set  having  two  Gaussian  density  distributions,  where 
/i  =  {0.1,  0.9},  a  =  0.1.  Distributions  for  h  =  1,2,3  are 
shown,  where  e  =  —  1  is  used. 


massh{x) 


'  71-1 

^2  mi(x)p(si),  h  =  1 

i—  1 
n—  1 

^  masslf~1(x)p(si),  h  >  1 

s  2—1 


where  rrii(x)  = 


i  if  x  <  Xi 
n  —  i  if  x  >  Xi 
„h- 1 


and  mass f  i(x)  is  mass'1  1(x)  defined  using  Di  C  D, 
i.e.,  the  data  subset  separated  by  Sj  that  includes  x. 


B.  Axioms  for  Massim 

Function  Mass  :  R1  xl1  — >  R+  satisfies  the  four  axioms 
as  shown  in  the  second  column  of  Table  III. 

A  comparison  with  the  axioms  for  distance-based  simi¬ 
larity  is  also  provided  in  Table  III.  The  key  difference  is  in 
axiom  2:  (i)  the  minimum  of  Mass(x,y)  is  Mass(x,  x)\ 
and  (ii)  Mass(x,  x)  is  not  the  same  for  all  x.  Figures  2(a) 
and  2(b)  show  that  the  distributions  of  Mass(x,y)  (for  a 
given  y)  and  Mass(x,x),  respectively. 

The  proofs  for  the  axioms  are  omitted  because  of  space 
limitation. 


C.  Multi-dimensional  similarity  measure 

The  one-dimensional  Massim  can  be  generalised  to 
multi-dimensional  by  using  a  model  E  that  partitions  the 
feature  space  into  local  regions,  instead  of  the  binary  splits. 

We  use  the  same  approach,  as  proposed  by  [7],  to  elim¬ 
inate  the  need  to  compute  the  probability  of  a  binary  split, 
p{si)~,  and  this  produces  a  randomised  approximation. 

The  idea  is  to  generate  multiple  random  local  regions 
that  satisfy  the  mass  inequality,  and  the  mass  similarity 
of  any  two  instances  is  estimated  by  averaging  the  mass 
of  all  smallest  local  regions  which  cover  both  instances. 


We  show  that  random  regions  can  be  generated  using  axis- 
parallel  splits  called  half-space  splits.  Each  half-space  split 
is  performed  on  a  randomly  selected  attribute  in  a  multi¬ 
dimensional  feature  space.  For  an  //-level  split,  a  tree  struc¬ 
ture  is  formed  in  which  each  path  has  h  half-space  splits, 
and  the  tree  has  a  total  of  2h  non-overlapping  local  regions. 

Let  E  (h)  be  a  multi-dimensional  model  producing  2h 
local  regions  as  a  result  of  an  ft-level  split.  Let  7?(x,  y|E(/i)) 
be  the  smallest  region  of  E (h)  covering  both  x  and  y,  where 
x, y  £  and  9Jt/,(x,  y)  be  the  mass  in  R(x, y|E(/i)). 

In  addition,  multiple  models  are  required  which  give  rise 
to  y),  the  mass  in  i?(x,  ylE^/i)). 

In  one-dimensional  problems.  Equations  (1)  and  (2)  can 
now  be  approximated  as  follows: 

n—1  t 

(^2 Mi{x,y)ep(si))e  «  (-^ 2miAx,y)e)°  (3) 

2=1  1  1=1 

1  -  ^  , 

Massh(x,  y)  «  (,  (4) 

0  i=i 

where  t  >  1  is  the  number  of  random  regions  to  be  used  to 
define  the  mass  similarity  of  x  and  y. 

Since  E(/t)  is  defined  in  multi-dimensional  space,  the 
multi-dimensional  mass  similarity  is  the  same  as  Equation 
(4)  by  simply  replacing  x  and  y  with  x  and  y: 

1  4 

Massh{x,y)  «  (- ^  9J?v(x,  y)e)  =  (5) 

2=1 

An  half-space  tree  implementing  3 •)  that  generates 
regions  r,  3 i,j  r,j  C  r,;,  must  satisfy  the  conditions  specified 
in  Section  IV:  \/i,j:i  j  7^  0  and  r-i  \  rj  0  to  ensure 
the  mass  inequality. 

This  is  realised  in  a  similarity  tree  (sTree),  where  a  data 
subset  T>  C  D  is  partitioned  into  two  equal-size  subsets 
recursively  until  it  cannot  be  subdivided.  Each  sTree  is 
balanced  having  height  h  =  log2{ip)  and  ip  regions,  where 
ip  =  \T>\.  D  is  then  populated  to  the  ip  regions  to  estimate 
the  mass  in  each  region.  The  detail  of  the  implementation 
is  provided  in  Section  VI. 

To  simplify  notation,  we  will  drop  h  hereafter  to  denote 
Massh(-,  •)  as  Mass( •,  •),  when  the  context  is  clear. 

VI.  Implementation 

Mass-based  similarity  estimation  is  implemented  in  two 
stages.  In  the  modelling  stage,  a  Similarity  Forest  (sForest) 
is  generated  from  D.  The  resultant  sForest  is  the  similarity 
model  of  the  given  dataset.  Mass-based  similarity  between 
two  instances  is  calculated  in  the  estimation  stage. 

The  sForest  generation  process  is  provided  in  Algorithm 
1.  A  sForest  consists  of  t  Similarity  Trees  (sTrees). 

Each  sTree  is  built  independently  from  V,  randomly 
selected  without  replacement  from  D,  where  \T>\  =  ip.  A 
sTree  node  can  either  be  an  external  node  or  an  internal  node 


Algorithm  1:  sForest(Z?,i,t//) 

Input:  D  -  Database,  t  -  number  of  trees,  ip  -  sub 
sampling  size 
Output:  sForest 
1:  Initialize  sForest 
2:  for  i  =  1  — ¥  t  do 

3:  V  select  ip  instances  from  D  without  re¬ 

placement. 

4:  T  <r-  sTree(V) 

5:  UpdateTreeMass(T,  D) 

6:  sForest  4—  sForest  U  T 

1:  end  for 
8:  return  sForest 


Algorithm  2:  sTree(2?) 

Input:  V  -  input  data 
Output:  sTree 

1:  if  \V\  is  1  then 
2:  return  ex  Node 

3:  end  if 

4:  Let  A  be  the  complete  list  of  attributes 
5:  a  Randomly  selected  attribute  from  A 

6:  T> sorted  Sort  V  011  a 

7:  Vi  <—  value  of  a  of  )th  item  in  Vsorted 
8:  Vr  t—  value  of  a  of  (^  +  l)th  item  in  Vsorted 
9 :  V  <r-  Randomly  selected  value  between  l'/  and  VT 
10:  T>i  4-  filter(V,  a  <  V  ) 
it:  T>i  <—  filter {T> ,  a  >  V  ) 

12:  return  innode{  LeftChild  sTree{Vi), 

RightChild  sTree(Vr),  SplitAttribute  4—  a, 
SplitValue  4—  V  } 


with  exactly  two  child  subtrees.  An  internal  node  consists 
of  a  randomly  selected  attribute  a  and  a  split  value  V,  such 
that  test  a  <  V  divides  the  training  set  in  this  node  into  two 
equal-size  subsets.  The  process  is  repeated  in  each  subset 
recursively  until  \T>\  =  1,  as  described  in  Algorithm  2. 
After  creating  a  sTree,  it  is  populated  with  D  to  determine 


Algorithm  3:  UpdateTreeMass(T,Z?) 

Input:  T  -  sTree  ,  D  -  Data 

Output:  T(sTree),  having  all  nodes  updated  with  9Jt 
1:  T.mass  4—  \D\ 

2:  if  T  is  an  internal  node  then 

3:  Di  4 —  filter(D  ,T.  Split  Attribute 

<  T. SplitValue) 

4:  Dr  f  liter  (D,T.  Split  Attribute 

>  T. SplitValue) 

5:  UpdateTreeMasslT. LeftChild,  Di) 

6:  UpdateTreeMasslT. RightChild,  Dr) 

7:  end  if 


Table  IV:  Measures  and  feedback  formulations  used  in  ReFeat,  MassIR  and  Lp- IR.  Note  that  x  =<  ,  x , . . . ,  x ^  >. 


ReFeat 

MassIR 

LP- IR 

Score 

t 

Score(x|q)  =  |^(«Jt(q)  A(x)) 

t 

Mass(x.q)  =  (  jgti(x,  q)e) i 

d  1 
dist (x,  q)  =  (y  —  q(J^)p)p 

Model 

%  =  \ 

Wi(-)  is  a  function  of  ^(-);  and  is 

derived  from  iForest 

%=  1 

IJt,  ( • ,  •)  is  derived  from  sForest 

3=  l 

No  trained  models  are  required 

Feedback 

w*  (2)  =  jp]  TO»(y) 

Mass(x,  Q)  =  Mass(x,  y)e 

dist(x,  Q)  =  -p^j  ^  dist(x,  y) 

Q  =  PUjV 

yev 

ye-P 

yev 

-I PT  w *(z) 

~^W\  Y1  Mass(x.  z)e]i 

-  7m  dist(x,z) 

zGAT 

zGA f 

zGAT 

Axioms 

Violate  all  four  distance  axioms 

Violate  distance  axiom  2 

p  >  1:  Satisfy  all  distance  axioms 
p  <  1:  Triangle  inequality  violated 

VII.  MassIR 


the  mass  in  each  node.  It  is  done  using  Algorithm  3.  To 
estimate  Mass(x,y),  x  and  y  are  parsed  through  t  sTrees 
as  described  in  Algorithms  4  and  5. 

Note  that  sTree  is  a  balanced  binary  tree;  and  let  Cx  be 
the  index  of  the  external  node  in  which  x  falls  in  sTree  i 
and  Clx  £  [1,2 ...ip).  Since  Cx  is  fixed  for  each  x  £  R,  it 
can  be  computed  as  preprocessing. 

For  any  instance  y,  SUt/M(x,  y)  in  Equation  5  is  the  mass 
of  the  lowest  common  ancestor  node  of  Cx  and  C'y  in 
sTree  i.  This  implementation  converts  from  the  bulk  of  tree 
traversals  of  (x,  y)  to  table  lookups  based  on  indices  (Cx, 
Cly).  It  reduces  the  time  complexity  for  Algorithms  4  and  5 
from  0(ntlog(ip))  to  0((n  +  ip)t).  where  n  =  \D\. 

Algorithm  4:  Mass(x,y,.F,e) 

Input:  F  -  sForest  with  t  number  of  sTrees,  e  - 
Exponent,  x,  y  -  Instances 

Output:  Mass'1  (x,  y)  as  in  Equation  4 
1:  Mass  «—  0 
2:  for  i  =  1  — >  t  do 
3:  T  ■£-  ith  sTree  in  F 

4:  Mass  <—  Mass+  {TreeMass(T,x,y)}e 

5:  end  for 
6:  Mass  <— 

7:  return  Mass 


Algorithm  5:  TreeMass(T,x,y) 

Input:  T  -  sTree ,  x,  y  -  Instances 
Output:  mass  of  y  with  respect  to  x  for  T 
1:  if  T  is  an  external  node  then 
2:  return  T.mass 

3:  end  if 

4:  if  X(T- Spin  Attribute)  <  T.SplitValue  And 

y(T. Split Attribute)  <  T.SplitValue  then 
5:  return  TreeMass(T.LeftChild,x,  y) 

6:  else  if  X{T. Split  Attribute)  >  T.SplitValue  And 

y(T. Split Attribute)  >  T.SplitValue  then 
7:  return  TreeMass(T.RightChild,x,  y) 

8:  else 

9:  return  T.mass 

10:  end  if 


This  section  describes  a  new  information  retrieval  system 
constructed  from  Massim,  called  MassIR.  It  is  motivated 
to  improve  the  retrieval  performance  of  ReFeat. 

Like  in  ReFeat,  MassIR  has  two  stages.  In  the  off¬ 
line  modelling  stage,  a  sForest  is  built  from  D  as  described 
in  section  VI.  In  the  on-line  retrieval  stage,  mass-based 
similarity  is  used  to  rank  the  instances  in  D  with  respect 
to  a  query  and  its  feedback  instances. 

For  a  given  query  q.  A/  ass(x,  q)  is  estimated  for  all 
x  in  D.  The  ranking  with  respect  to  q  is  conducted  as 
follows:  V  Xi,Xj  £  D,  if  Mass(xi,  q)  <  Mass(xj,  q)  then 
similarity (xi,  q)  >  similarity (xj,q). 

The  feedback  mechanism  in  MassIR  is  given  in  the 
second  column  of  Table  IV,  where  Q  =  V  U  Af\  and 
V  denotes  the  set  of  positive  feedbacks  and  query,  and 
AT  denotes  the  set  of  negative  feedbacks.  The  ranking  of 
instances  x  £  D  is  based  on  Mass(x,  Q). 

MassIR  takes  O (tip  log  (ip))  time  to  generate  t  sTrees.  It 
takes  0(nt  log  (ip))  to  update  the  mass  values  in  tree  nodes. 
Altogether,  the  offline  processing  time  is  0(nt  log  (ip)).  The 
time  complexity  for  a  query  has  already  been  discussed  in 
Section  VI  which  takes  0((n  +  ip)t).  Together  with  |Q|  —  1 
feedbacks,  the  total  computation  time  takes  0((n+\Q\ip)t). 

A.  Comparison  with  ReFeat  and  LP-IR 

In  addition  to  MassIR,  we  have  created  a  new  informa¬ 
tion  retrieval  system  called  Lp- IR  which  has  a  feedback 
mechanism  similar  to  ReFeat  but  based  on  the  commonly 
used  distance  function,  //'-norm.  A  comparison  of  MassIR, 
ReFeat  and  Lp- IR  is  provided  in  Table  IV. 

The  score  function  of  ReFeat  is  essentially  based  on 
the  unary  function  £(•).  MassIR  and  Lp- IR  are  based 
on  Massim  and  //'-norm,  respectively,  to  compute  the 
similarity  of  two  instances.  ReFeat  and  MassIR  must 
build  a  model  in  order  to  assess  the  similarity  between 
instances,  but  Lp- IR  does  not  need  a  model. 

The  feedback  mechanism  has  the  same  form  in  all  three 
systems;  but  ReFeat  modifies  the  weight  function  w(q) 
only,  independent  of  x.  MassIR  and  Lp- IR  have  exactly 
the  same  formulation  when  e  =  1,  except  different  similar¬ 
ity  measures  are  used.  ReFeat  violates  all  four  distance 
axioms;  MassIR  violates  some;  and  ZA-IR  satisfies  all 


Table  V:  Time  Complexity  Comparison 


Off-line 

Query 

Feedback 

ReFeat 

MassIR 
LP- IR 

0(nt  log  (V>)) 
0(nt  log  (V>)) 

0{(n  +  log  (ip))t) 
0((n  +  ip)t) 
0(nd ) 

0((n  +  \Q\)t) 
0((n  +  \  Q\ip)t) 
0(n\Q\d ) 

axioms  for  p  >  1,  but  violates  the  triangle  inequality  axiom 
for  p  <  1. 

Table  V  contains  the  comparison  of  the  time  complex¬ 
ities  of  MassIR  with  ReFeat  and  Lp- IR.  Except  in 
very  small  databases,  n  >  \Q\ip.  Hence,  query  and  feed¬ 
back  time  complexities  of  MassIR  and  ReFeat  become 
0(nt).  Similarly,  the  space  complexities  of  MassIR  and 
ReFeat  are  both  0(nt).  The  space  complexity  of  Lp- IR 
is  0(n).  In  MassIR  and  ReFeat,  t  is  a  constant  parameter. 
Therefore,  space  and  time  requirements  of  MassIR  and 
ReFeat  increase  linearly  with  the  number  of  instances  in 
the  database.  However,  this  is  not  the  case  for  Lp- IR  as 
feedback  efficiency  is  much  affected  by  Q\  and  the  number 
of  dimensions,  d. 

B.  Experiments 

The  aims  of  the  experiments  are  to  (i)  compare  MassIR 
with  state-of-the-art  systems  ReFeat  [1],  Qsim  [3],  In- 
stRank  [6],  MRBIR  [4]  and  BALAS  [5];  and  (ii)  analyse 
the  behaviours  of  MassIR,  ReFeat  and  Lp- IR  when  key 
parameter  settings  are  changed. 

We  employ  two  databases  which  were  previously  used  in 
other  studies:  GTZAN  music  database  [1],  [22]  and  COREL 
image  database  [1],  [23], 

The  GTZAN  database  contains  1000  songs.  Each  song 
belongs  to  one  of  10  genres,  namely  classical,  country,  disco, 
hiphop,  jazz,  rock,  blues,  reggae,  pop  and  metal.  There  are 
100  songs  in  each  genre.  Each  song  is  stored  as  a  22,050Hz, 
16  bit  mono-audio  file,  which  is  a  30-second  excerpt  [22]. 
Each  instance  has  230  attributes,  extracted  from  the  music 
files,  as  reported  in  [1],  The  COREL  database  has  10,000 
images.  They  belong  to  100  classes.  Each  class  has  100 
images.  Each  instance  has  67  attributes,  comprised  of  32 
colour  features,  24  texture  features  and  11  shape  features. 
Feature  vectors  of  the  COREL  database  are  elaborated  in 
[23], 

Experiments  to  evaluate  the  retrieval  performances  of 
competing  methods  were  designed  as  follows.  A  query  was 
randomly  selected  from  the  database.  A  retrieval  method 
ranked  the  instances  in  the  database  with  respect  to  the 
query.  In  each  feedback  round,  two  positive  instances  (hav¬ 
ing  the  same  class  as  the  query)  and  two  negative  instances 
(having  classes  other  than  the  query  class)  were  provided. 
The  method  then  re-ranked  each  instance  in  the  database 
with  respect  to  the  query  and  the  feedbacks.  Five  different 
queries  were  randomly  selected  from  each  class  in  the 
database;  and  up  to  five  feedback  rounds  were  performed 
for  each  query,  where  instances  from  a  feedback  round 
were  added  to  those  from  the  previous  rounds.  Instances 
already  used  as  query  and  feedbacks  were  not  included  in 


the  database  for  retrieval.  This  experimental  design  is  the 
same  as  used  in  [1]. 

The  above  was  repeated  20  times,  each  using  a  different 
forest  from  MassIR  or  ReFeat.  Accordingly,  a  total  of 
1,000  queries  and  5,000  feedback  rounds  were  tested  in 
GTZAN;  and  a  total  of  10,000  queries  and  50,000  feedback 
rounds  were  tested  in  COREL.  The  retrieval  performance 
is  measured  in  terms  of  Mean  Average  Precision  (MAP). 
A  two-standard  error  significance  test  is  used  to  examine 
whether  the  performance  difference  in  a  comparison  is 
significant. 

MassIR  and  ReFeat  were  implemented  using  MAT- 
LAB.  The  experiments  were  conducted  on  Sun  Grid  Engine 
(SGE).  4GB  and  8GB  memory  were  allocated  for  informa¬ 
tion  retrieval  tasks  in  GTZAN  and  COREL,  respectively. 

The  results  are  presented  in  the  following  three  para¬ 
graphs.  Figure  3  shows  the  comparison  of  MassIR  with 
ReFeat,  Qsim,  InstRank,  MRBIR  and  BALAS.  Results  of 
Qsim,  InstRank,  MRBIR  and  BALAS  were  taken  from  [1], 
The  best  parameter  settings  reported  by  [1]  were  used  for 
ReFeat  (i.e.,  ip  =  4  for  GZTAN,  ip  =  S  for  COREL,  and 
7  =  0.25).  MassIR  uses  ip  =  256,  7  =  0  and  e  =  — 1 .  The 
result  shows  that  MassIR  performs  significantly  better  than 
all  other  existing  methods.  Also  note  that  the  performance 
gap  increases  as  the  number  of  feedback  rounds  increases, 
indicating  that  MassIR  utilizes  feedback  instances  more 
effectively  than  other  methods. 

Figure  4  shows  the  behaviour  of  MassIR  and  ReFeat 
when  the  sample  size,  ip,  changes.  ReFeat  performs  best 
using  small  sample  size.  In  contrast,  the  performance  of 
MassIR  mono  tonic  ally  increases  as  ip  increases. 

Figure  5  shows  the  retrieval  performance  of  MassIR  and 
Lp- IR  as  7  varies.  MassIR  was  tested  for  e  =  — 1  and  1. 
Lp- IR  was  tested  for  p  =  .25,  0.5,  0.75,  1  and  2.  Recall 
that  /T-norm  is  a  nonmetric  violating  the  triangle  inequality 
axiom  when  p  <  1.  The  results  show  that  MassIR  with  e  = 
—  1  was  better  than  MassIR  with  e  =  1  and  Lp- IR  for  all 
7.  MassIR  performed  the  best  with  7  =  0  in  both  databases. 
However,  this  is  not  the  case  for  Lp- IR.  Depending  on 
the  p  value  and  the  database,  the  performance  peaked  at 
either  0.5  or  0.75.  In  addition,  MassIR  is  less  sensitive  with 
7  when  compared  to  Lp  - 1 R.  It  is  interesting  to  note  that 
MassIR  performs  the  best  at  7  =  0,  i.e.,  negative  feedbacks 
have  a  negative  impact,  albeit  small.  In  contrast,  negative 
feedbacks  have  a  significant  positive  impact  on  the  retrieval 
performance  of  Lp- IR. 

Note  that  Lp~ IR  is  our  creation  of  distance  based  ap¬ 
proach.  With  the  right  setting,  it  performs  better  than  Qsim, 
InstRank,  MRBIR  and  BALAS  (see  their  results  showed  in 
Figures  3  and  5);  and  Lp- IR  has  the  same  level  of  retrieval 
performance  as  ReFeat. 

The  execution  time  comparison  is  given  in  the  Table  VI. 
Compared  head-to-head  using  ip  =  8,  ReFeat  runs  5.6 
times  faster  than  MassIR,  i.e.  they  are  in  the  same  order  of 


Round  of  relevance  feedback  Round  of  relevance  feedback 


(a)  GTZAN  music  database 


(b)  COREL  image  database 


Figure  3:  Comparison  of  retrieval  performance  in  terms  of  MAP 


Figure  4:  Relative  retrieval  performance  of  MassIR  and  ReFeat  as  ip  increases  (feedback  round  5  performance). 
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Figure  5:  Relative  retrieval  performance  of  MassIR  and  Lp- IR  for  different  values  of  7  (feedback  round  5  performance). 
The  results  of  Lp- IR  (p  =  0.25)  were  worse  than  the  best  results  from  other  p  values  and  they  were  omitted  to  increase 
readability. 


Table  VI:  Online  execution  time  per  query/feedback  in  mil¬ 
liseconds  for  COREL  for  MassIR  ip  =  8  (MIR8),  ReFeat 
(RF),  U'-IR  (LPIR),  Qsim  (QS),  InstRank  (IR).MRBIR 
(MR)  and  BALAS  (BA) 


Round 

MIR8 

RF 

LPIR 

QS 

IR 

MR 

BA 

query 

194 

35.8 

5.5 

24.7 

24.7 

612.9 

N.A. 

1 

194 

35.1 

16.2 

71.3 

32.6 

1172.4 

262.8 

2 

195 

35.4 

26.8 

146.3 

33.4 

1172.3 

317.5 

3 

195 

35.0 

37.4 

261.9 

34.2 

1172.3 

373.0 

4 

196 

34.8 

48.0 

417.9 

34.9 

1172.2 

473.9 

5 

196 

34.8 

58.5 

615.8 

35.5 

1172.1 

506.0 

time  complexity  0{nt).  This  is  consistent  with  the  analysis 
of  time  complexity  provided  in  section  VII-A. 

It  is  important  to  emphasise  that  MassIR  behaves  dif¬ 
ferently  from  ReFeat  in  two  ways.  First,  the  retrieval 
performance  of  MassIR  improves  as  ip  increases;  but  this  is 
not  the  case  for  ReFeat.  Second,  negative  feedbacks  have 
an  impact  on  ReFeat  but  not  on  MassIR. 

VIII.  Discussion 

Distance  metric  learning  [24]  aims  to  learn  a  metric, 
that  obeys  the  four  distance  axioms,  to  achieve  a  better 
performance  outcome  than  a  standard  metric  such  as  Eu¬ 
clidean  distance.  There  are  two  approaches.  First,  the  super¬ 
vised  approach  requires  information  about  pairs  of  instances 
belonging  to  the  same  class  as  well  as  different  classes 
in  the  training  data.  This  information  is  used  to  compute 
the  feature  relevance  that  changes  the  neighbourhood  of  a 
test  instance  to  improve  the  decision  outcome.  Second,  the 
unsupervised  approach  or  manifold  learning  aims  to  learn  a 
low  dimensional  manifold  where  the  distance  between  most 
instances  is  preserved;  it  is  closely  related  to  dimension 
reduction.  In  contrast,  albeit  an  unsupervised  approach, 
Massim  is  a  nonmetric  that  neither  computes  distance  nor 
requires  dimension  reduction. 

Unlike  many  other  nonmetric  similarity  measures  (e.g., 
/A-norm  where  p  <  1),  the  new  measure  satisfies  the 
triangle  equality  axiom.  Thus,  it  can  use  existing  indexing 
schemes  such  as  M-Trees  [25]  to  reduce  the  online  time 
complexity  of  MassIR  from  0(nt)  to  0(log(n)t). 

Similarities  and  differences  between  iForest  and  sForest 
are  given  as  follows.  Both  ensure  non-empty  regions  are 
constructed.  iForest  is  likely  to  produce  unbalanced  trees, 
each  using  a  ip  subset;  and  it  is  a  unary  function.  Mapping 
to  K*  using  iForest  to  produce  t  relevance  features  is  an 
essential  step  in  ReFeat. 

In  contrast,  sForest  produces  balanced  trees  only,  each 
using  a  ip  subset  but  requires  the  entire  data  set  to  determine 
the  mass  in  each  local  region;  and  it  is  binary  function.  No 
feature  space  mapping  is  required  in  MassIR. 

Also  note  that  the  resultant  values  of  £(•)  and  Mciss(-,  •) 
for  any  given  test  instances  would  increase  if  iForest  were 
trained  using  a  larger  subsample  or  sForest  were  trained 
using  a  larger  training  set  (even  with  the  same  subsample 


size).  They  can  be  easily  normalised  to  be  independent  of 
training  size,  if  this  is  required. 

Our  result  shows  that  MassIR  only  requires  positive 
feedbacks  (i.e,  7  =  0)  to  improve  its  retrieval  performance. 
This  reduces  its  time  complexity  from  0((n  +  \  Q\ip)t)  to 
0((n+  \V\ ip)t). 

The  harmonic  mean  (e  =  —1)  is  less  susceptible  to  large 
outliers  than  the  arithmetic  mean  (e  =  1).  Therefore,  we  can 
expect  MassIR  to  perform  better  when  e  =  —  1  than  e  =  1. 

Treating  e  =  —  1  and  7  =  0  as  default,  the  only  parameter 
that  needs  to  be  set  for  MassIR  is  the  sample  size,  ip  (as 
t  should  be  set  to  a  high  value  such  as  1000,  as  in  the  case 
of  ReFeat).  Since  both  the  retrieval  performance  and  the 
processing  time  increase  with  ip,  the  parameter  selection  of 
MassIR  is  a  trade-off  between  retrieval  performance  and 
runtime. 

It  is  interesting  to  note  that  the  retrieval  performance  of 
Lp-  IR  could  be  improved  by  using  the  harmonic  mean 
rather  than  the  arithmetic  mean  in  the  feedback  formulations 
shown  in  Table  IV  (but  not  for  ReFeat).  However,  this 
improvement  is  still  worse  than  MassIR. 

We  have  shown  that  Massim  is  significantly  better  than 
/A-norm  (used  in  InstRank,  Qsim,  BALAS  and  Lp  - 1 R)  in 
two  data  sets.  However,  a  caveat  is  in  order.  One  should 
not  expect  Massim  to  outperform  Lp- norm  in  all  cases. 
There  are  a  number  of  factors  which  affect  the  performance, 
e.g.,  whether  the  algorithms  make  full  use  of  the  similarity 
measure,  and  whether  the  characteristics  of  the  domain 
match  the  similarity  measure. 

There  are  many  applications  using  tree  structures.  Before 
examining  the  superficial  similarities,  one  must  check  the 
purposes  first.  sTree  is  created  for  similarity  measurement. 
There  are  different  trees  created  for  indexing  in  order  to 
speed  up  the  nearest  neighbour  search,  e.g.,  k-d  tree.  The 
superficial  similarity  between  sTrees  and  k-d  trees  is:  the 
latter  is  based  on  median  split  and  the  former  is  based  on  a 
randomly  selected  split  close  to  median.  There  are  many 
differences,  e.g.,  (i)  a  k-d  tree  is  created  by  cycling  all 
dimensions  one  at  a  time  to  build  each  subsequent  internal 
node  in  the  tree,  but  sTree  randomly  selects  an  attribute  to 
build  an  internal  node;  (ii)  a  k-d  tree  is  constructed  using 
the  entire  data  set,  but  a  sTree  is  built  using  a  small  subset. 
These  similarities  and  differences  are  superficial  as  far  as 
the  purpose  is  concerned. 

IX.  CONCLUDING  REMARKS 

Mass-based  similarity  measures,  Massim,  represents  a 
paradigm  shift  from  measuring  similarity  in  terms  of  dis¬ 
tance  to  measuring  similarity  in  terms  of  mass. 

We  establish  the  theoretical  foundation  of  Massim,  which 
is  a  generalisation  of  mass  estimation.  This  generalisation 
shows  that  the  new  binary  function  for  similarity  measure 
Mass(-,-)  produces  better  retrieval  performance  than  the 
commonly  used  distance-based  similarity  measures. 


This  paper  shows  the  effect  of  overcoming  the  key  weak¬ 
ness  of  ReFeat  by  ensuring  that  two  similar  instances  are 
in  the  same  local  neighbourhood.  This  contributes  directly 
to  (i)  the  better  retrieval  performance  of  Mass IR;  and  (ii)  a 
more  desirable  behaviour  of  MassIR,  where  the  retrieval 
performance  of  MassIR  can  be  improved  by  increasing 
if.  Our  empirical  results  verify  that  MassIR  has  a  better 
retrieval  performance  than  ReFeat  while  having  the  same 
order  of  time  and  space  complexities. 
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An  ensemble  approach  to  estimate  multi-dimensional 
likelihood  in  Bayesian  classifier  learning 

Sunil  Aryal  •  Kai  Ming  Ting 


Abstract  In  Bayesian  classifier  learning,  estimating  the  joint  probability  distri¬ 
bution  p(x,  y)  or  the  likelihood  p(x|y)  directly  from  training  data  is  considered 
to  be  difficult,  especially  in  large  multi-dimensional  data  sets.  In  order  to  circum¬ 
vent  this  difficulty,  existing  Bayesian  classifiers  such  as  Naive  Bayes,  BayesNet  and 
A77DE  have  focused  on  estimating  simplified  surrogates  of  p(x,  y)  from  different 
forms  of  one-dinrensional  likelihoods. 

Contrary  to  the  perceived  difficulty  in  multi-dimensional  likelihood  estimation, 
we  present  a  simple  ensemble  approach  to  estimate  multi-dimensional  likelihood 
directly  from  data.  The  idea  is  to  aggregate  p,(x|y)  estimated  from  a  random  sub¬ 
sample  of  data  T>i  (i  =  1,  2,  •  •  • ,  t).  This  paper  presents  two  ways  to  estimate  multi¬ 
dimensional  likelihoods  using  the  proposed  ensemble  approach  and  introduces  two 
new  Bayesian  classifiers  called  ENNBayes  and  MassBayes  that  estimate  pi(x|y) 
using  a  nearest  neighbour  density  estimation  and  a  probability  estimation  through 
feature  space  partitioning,  respectively. 

Unlike  the  existing  Bayesian  classifiers,  ENNBayes  and  MassBayes  have  con¬ 
stant  training  time  and  space  complexities  and  scale  better  than  existing  Bayesian 
classifiers  in  very  large  data  sets.  Our  empirical  evaluation  shows  that  ENNBayes 
and  MassBayes  yield  better  predictive  accuracy  than  the  existing  Bayesian  classi¬ 
fiers  on  benchmark  data  sets. 

Keywords  Bayesian  classifiers,  Multi-dimensional  likelihood  estimation 


1  Introduction 

In  classification  task,  the  training  data  D  is  a  collection  of  labelled  instances 
{(x^,  y^)}  (i  =  1,2,  where  x  is  a  d-dimensional  vector  (xi,  X2,  ■  ■  ■ ,  Xd) 

and  y  is  a  label  from  the  predefined  set  of  c  classes. 
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The  Bayesian  approach  of  classifier  learning  models  the  joint  probability  dis¬ 
tribution  p(x,  y)  and  predicts  the  most  probable  class  using  Bayes  rule  as: 

V  =  argmax  p(x,y)  (1) 

y 

Using  the  product  rule,  the  joint  probability  can  be  factorised  as: 

p(x>2/)  =p(y)  X  p(x|j/)  (2) 

Bayesian  classifiers  learn  either  the  joint  probability  distribution  p(x,  y)  or 
the  conditional  probability  distribution  (likelihood)  p(x\y).  Estimating  p(x,y)  or 
p(x\y)  directly  from  data  is  considered  to  be  difficult  because  commonly  used 
density  estimators  such  as  kernel  density  estimator  (KDE),  and  fc-nearest  neigh¬ 
bour  (fcNN)  density  estimator  are  impractical  in  problems  with  large  data  sizes 
and  moderate  number  of  dimensions.  This  is  due  to  their  high  space  and  time 
complexities  (Silverman,  1986). 

However,  surrogates  of  p(x,  y)  can  be  estimated  efficiently  provided  that  some 
simplifying  assumptions  are  made  (e.g.,  attributes  are  independent  given  the  class 
label,  or  data  are  assumed  to  have  some  known  distribution  such  as  Gaussian).  Ex¬ 
isting  Bayesian  classifiers  make  some  kind  of  conditional  independence  assumption 
and  estimate  the  simplihed  surrogate  of  p(x,  y)  as  the  product  of  one-dimensional 
likelihoods.  Even  though  existing  Bayesian  classifiers  such  as  Naive  Bayes  (Lang¬ 
ley  et  al.,  1992),  BayesNet  (Friedman  et  al.,  1997)  and  ApDE  (Webb  et  al.,  2012) 
have  been  shown  to  perform  well  in  many  application  domains,  they  can  result 
in  poor  predictive  accuracy  in  many  other  real-world  problems  as  the  estimate  of 
p(x,  y)  is  a  biased  one-dimensional  approximation  of  the  true  distribution. 

Instead  of  researching  on  different  forms  of  one-dimensional  likelihoods,  we  re¬ 
examine  the  premise  that  multi-dimensional  likelihoods  are  difficult  to  estimate 
and  find  that  there  is  a  simple  way  to  estimate  multi-dimensional  likelihoods  di¬ 
rectly  from  data  using  an  ensemble  approach. 

In  this  paper,  we  make  the  following  three  contributions: 

1.  Presenting  the  first  generic  approach  to  estimate  p(x\y)  directly  by  aggregating 
estimations  from  an  ensemble  of  t  estimators  where  each  estimates  the  multi¬ 
dimensional  likelihood  pi(x\y)  using  a  fixed-size  random  sub-sample  T>i  C  D 
(i  =  1,2 ,■■■  ,t).  This  is  a  generic  approach  because  pi(x.\y)  can  be  estimated 
using  different  data  modelling  methods. 

2.  Introducing  two  variants  of  the  ensemble  approach  that  estimate  pi(x\y)  using 
a  nearest  neighbour  density  estimation  and  a  probability  estimation  through 
feature  space  partitioning,  and  produce  two  new  Bayesian  classifiers  called 
ENNBayes  and  MassBayes,  respectively. 

3.  Verifying  that  the  proposed  classifiers  produce  better  classification  accuracy 
and  scale  better  than  the  state-of-the-art  Bayesian  classifiers  in  large  data 
sets. 

The  proposed  approach  has  the  following  distinguishing  characteristics  over 
the  existing  Bayesian  classifiers: 

•  Unlike  existing  Bayesian  classifiers  that  estimate  one-dimensional  likelihoods, 
it  estimates  nrulti-dimensional  likelihoods  directly  from  data. 
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•  As  it  employs  fixed-size  sub-samples  of  data,  it  has  constant  training  time  and 
constant  space  complexities.  Thus,  it  can  be  easily  applied  in  very  large  data 
sets. 

The  rest  of  the  paper  is  structured  as  follows.  Section  2  provides  an  overview 
of  existing  well-known  Bayesian  classifiers.  Section  3  presents  the  ensemble  ap¬ 
proach  to  estimate  p(x|y)  directly  from  data  and  describes  two  new  Bayesian 
classifiers  based  on  the  proposed  approach  -  ENNBayes  and  MassBayes.  The  em¬ 
pirical  evaluation  results  are  presented  in  Section  4.  Finally,  we  provide  discussion 
and  conclusions  in  the  last  two  sections. 


2  Related  work 

This  section  provides  a  survey  of  previous  works  related  to  this  paper.  Subsection 
2.1  discusses  some  widely  used  methods  to  estimate  likelihood  in  real  domain  (lZd). 
Subsection  2.2  provides  a  survey  of  Bayesian  classifier  learning  and  describes  some 
well-known  Bayesian  classifiers. 


2.1  Likelihood  estimation 

The  likelihood  p(x|y)  for  continuous-valued  attributes  can  be  estimated  either 
through  density  estimation  or  by  estimating  probability  in  the  local  region  (LR) 
of  x. 

The  probability  p(x|y)  in  a  LR  can  be  estimated  as  a  ratio  of  the  number  of 
instances  in  the  LR  that  belong  to  class  y  and  the  total  number  of  instances  in 
class  y.  In  existing  Bayesian  classifiers,  continuous- valued  attributes  are  converted 
to  discrete  attributes  through  discretisation  and  the  LR  of  x  is  then  defined  by 
the  smallest  discrete  intervals  that  contains  x. 

Discretisation  divides  the  range  of  an  attribute  x  into  v  discrete  intervals  and 
maps  each  x  G  1Z  to  a  corresponding  interval,  yielding  a  categorical  attribute  with  v 
labels,  i.e.,  x*  G  {ai,  <Z2,  •  •  • ,  av}.  Different  methods  of  discretisation  determine  the 
intervals  in  different  ways.  In  unsupervised  discretisation  (e.g.,  equal  width  or  equal 
frequency  discretisation  (Catlett,  1991)),  class  label  is  not  used  while  selecting  the 
cut  points.  Supervised  discretisation  methods  find  the  cut  points  based  on  some 
criterion  that  takes  class  information  into  consideration  such  as  entropy  of  the 
intervals,  error  in  the  training  data  or  some  statistical  measure.  Fayyad  and  Irani 
(1995)  proposed  a  supervised  discretisation  method  that  selects  the  cut  points 
which  minimise  the  class  entropy.  In  classification,  the  supervised  methods  often 
produce  better  predictive  accuracy  than  the  unsupervised  methods  (Dougherty 
et  al.,  1995).  Note  that  the  combinations  of  values  grow  exponentially  with  d 
because  there  are  vd  possible  combinations.  Many  of  those  possible  combinations 
may  not  appear  in  the  observed  data.  If  the  unseen  instance  x  has  a  combination  of 
attribute  values  that  did  not  appear  in  the  observed  data,  it  yields  zero  probability. 
Hence,  the  multi-dimensional  likelihood  estimation  through  discretisation  may  not 
provide  a  good  estimate. 

Density  estimators  approximate  the  unknown  probability  density  function  / 
from  the  observed  data.  The  parametric  estimators  assume  that  the  data  are  drawn 
from  a  known  distribution  and  derive  the  function  /  by  estimating  parameters  of 
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the  distribution  such  as  mean  and  variance  of  a  Gaussian  distribution  (Silverman, 
1986).  But,  in  reality,  the  underlying  distribution  does  not  always  belong  to  a 
parametric  distribution.  Non-parametric  approaches  make  less  rigid  assumptions 
and  estimate  the  distribution  from  the  observed  data  (Silverman,  1986).  Kernel 
density  estimator  (KDE)  and  fc-nearest  neighbour  density  estimator  (fcNN)  are 
two  well-known  non-parametric  density  estimators. 

KDE  estimates  the  density  as  the  average  of  a  kernel  function  K(-)  centered 
on  each  observed  instance  as  (Silverman,  1986): 


/» 


1 

nhd 


E* 


/x-x(i) 


(3) 


where  h  is  the  smoothing  parameter  called  bandwidth. 

In  order  to  estimate  the  density  at  point  x,  fcNN  searches  fc  nearest  neighbours 
of  x  in  the  observed  data.  A  fcNN  density  estimation  can  be  expressed  as  follows 
(Breunig  et  al.,  2000;  Tan  et  al.,  2006): 


/(x) 


_ \N(x,  fc)  1 _ 

n  distance^ ,  x;) 

x'GiV(x,fc) 


(4) 


where  IV(x,  fc)  is  the  set  of  fc  nearest  neighbours  of  x  and  |s|  is  the  cardinality  of 
set  s. 

Both  KDE  and  fcNN  employ  the  expensive  distance  calculations  over  all  n 
observed  instances,  which  restricts  them  to  small  data  sets.  Apart  from  high  time 
and  space  complexity,  KDE  and  fcNN  require  search  to  find  optimal  values  for  h 
and  fc.  In  Bayesian  classification  framework,  the  density  distribution  of  each  class 
has  to  be  modelled  separately.  The  same  value  for  h  or  fc  may  not  be  equally 
good  for  modelling  density  distribution  of  all  the  classes.  Searching  for  an  optimal 
value  for  each  class  will  add  further  runtime  cost  to  already  expensive  distance 
calculations.  Hence,  they  are  impractical  in  large  multi-dimensional  data  sets. 

DEMass  (Ting  et  al.,  2013)  is  the  first  efficient  density  estimator  in  large  data 
set.  It  achieves  significant  improvement  over  KDE  and  fcNN  in  terms  of  time  and 
space  complexities  by  aggregating  density  estimations  from  t  different  random 
sub-samples  of  data  V,  (i  =?  .l,  2,  •  •  • ,  t).  It  is  an  ensemble  approach  where  density 
in  each  sub-sample  is  estimated  through  multi-dimensional  mass  estimation  (Ting 
and  Wells,  2010)  as  follows: 


/m(x) 


1  V-'  l(T4x))l 

t  '  |2?j|  Vi 
1=1 


(5) 


where  T;(-)  is  a  function  which  subdivides  the  feature  space  defined  by  Vi  into 
non-overlapping  regions  through  axis-parallel  divisions;  |(T)(x))|  is  the  number  of 
instances  in  the  smallest  region  of  T,  (x)  in  which  x  falls  into;  and  vt  is  the  volume 
of  region  Tt  (x) . 

DEMass  has  sub-linear  time  complexity  and  constant  space  complexity  (Ting 
et  al.,  2013).  But,  it  is  limited  to  low  dimensional  problems.  Also,  it  is  not  adaptive 
to  data  distribution  as  it  constructs  equal  volume  regions  in  the  data  space. 


Multidimensional  likelihood  estimation 


5 


2.2  Bayesian  classifier  learning 

Naive  Bayes  (NB)  (Duda  and  Hart,  1973;  Langley  et  al.,  1992)  is  the  simplest 
Bayesian  approach  to  solve  classification  problems.  It  assumes  that  the  attributes 
(aii,  X2,  ■  ■  ■ ,  x d)  are  statistically  independent  given  the  class  label  (y)  (class  condi¬ 
tional  independence)  and  estimates  p(x|y)  as: 

d 

p(*\y)  ~  Y[p(xj\y)  (6) 

r  i 

Kononenko  (1991)  extended  NB  to  detect  dependencies  between  attributes 
and  proposed  a  variant  of  NB  called  senri-naive  Bayes.  Langley  and  Sage  (1994) 
proposed  another  variant  of  NB,  called  selective  naive  Bayes  by  removing  the 
correlated  attributes  in  the  modelling  process.  In  the  traditional  NB,  the  distri¬ 
bution  of  a  continuous-valued  attribute  is  assumed  to  be  a  Gaussian  distribution 
and  p{xi\y)  is  estimated  through  normal  density  estimation.  Langley  and  John 
(1995)  have  shown  the  improvement  in  predictive  accuracy  of  NB  by  replacing  the 
parametric  normal  density  estimator  with  a  non-parametric  kernel  density  estima¬ 
tor  (KDE).  A  similar  conclusion  was  drawn  by  Dougherty  et  al.  (1995)  through 
discretisation. 

Despite  the  strong  assumption  of  class  conditional  independence,  NB  produces 
impressive  results  in  many  application  domains  where  the  assumption  does  not 
always  hold  (Clark  and  Niblett,  1989;  Kononenko,  1993;  Domingos  and  Pazzani, 
1997).  Its  simplicity  and  clear  probabilistic  semantics  have  motivated  researchers 
to  explore  further  to  lessen  its  assumption  of  conditional  independence. 

Naive  Bayes  tree  (NBTree)  (Kohavi,  1996)  combines  the  benefits  of  decision 
tree  and  NB  by  using  NB  at  the  leaves  of  the  decision  tree  to  make  predictions. 
It  has  been  shown  that  NBTree  outperforms  NB  (Kohavi,  1996).  Lazy  Bayesian 
rules  (LBR)  (Zheng  and  Webb,  2000)  builds  a  rule  which  best  characterises  each 
test  instance  during  testing  and  builds  local  NB  to  classify.  Another  lazy  Bayesian 
classifier  is  locally  weighted  naive  Bayes  (LWNB)  (Frank  et  al.,  2003)  that  searches 
for  k  nearest  neighbours  of  each  test  instance  and  builds  a  NB  based  on  k  nearest 
neighbours  weighted  by  their  distance  to  the  test  instance.  All  these  classifiers 
improve  the  prediction  accuracy  of  NB,  but  they  are  expensive  in  terms  of  runtime 
and  inapplicable  in  large  data  sets. 

A  Bayesian  network  (BayesNet)  (Friedman  et  al.,  1997)  learns  probabilistic 
relationships  among  the  attributes  and  class  from  the  training  data,  in  the  form 
of  a  directed  acyclic  graph  (DAG).  In  a  graph,  each  node  is  probabilistically  in¬ 
dependent  of  its  non-descendants  given  the  state  of  its  parents.  At  each  node, 
conditional  probabilities  with  respect  to  its  parents  are  learned  from  the  training 
data.  The  joint  probability  p(x,  y)  is  estimated  as: 

p(x,  y)  =  p(a:i|7ri)  x  p( £2^2)  x  •  •  •  x  p(xd\^d)  x  piy^y)  (7) 

where  nj  is  parent(xj)  and  ttv  is  parent(y) . 

NB  is  the  simplest  form  of  Bayesian  networks,  where  each  attribute  has  y  as 
its  only  parent. 
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Learning  an  optimal  network  of  probabilistic  dependencies  is  difficult  and  com¬ 
putationally  intractable  as  it  requires  searching  over  every  possible  network  (Chick- 
ering,  1996;  Jiang  et  al.,  2007).  Thus,  in  practice,  either  some  heuristics  are  used 
or  some  restrictions  are  imposed  on  the  network  structure. 

Friedman  et  al.  (1997)  proposed  a  Bayesian  network  called  tree  augmented 
naive  Bayes  (TAN),  where  the  probabilistic  relationship  is  a  tree-like  structure. 
TAN  allows  each  attribute  to  have  one  additional  parent  along  with  the  class 
attribute.  Another  variant  of  TAN  learning  is  super  parent-TAN  (SP-TAN)  (Keogh 
and  Pazzani,  1999)  that  searches  for  a  common  super  parent  which  yields  the 
minimum  error  in  the  training  data.  Grossman  and  Domingos  (2004)  proposed  a 
heuristic  to  produce  a  Bayesian  network  classifier  that  learns  a  network  structure 
(probabilistic  relationship)  by  maximising  conditional  likelihood. 

Aggregating  independence  estimators  (A?7DE)  (Webb  et  al.,  2012)  avoids  the 
expensive  search  in  learning  probabilistic  dependencies  by  constructing  an  ensem¬ 
ble  of  77-dependence  estimators.  It  is  the  only  ensemble  approach  that  was  designed 
specifically  to  estimate  p(x,  y).  It  began  with  one-dependence  estimators  (Webb 
et  al.,  2005)  and  then  generalised  to  77-dependence  estimators  (Webb  et  al.,  2012). 
Each  estimator  allows  dependency  between  y  and  77  privileged  attributes  or  super¬ 
parents.  The  other  attributes  are  assumed  to  be  conditionally  independent  given 
the  77  super-parents  and  y.  The  joint  probability  p(x,  y)  is  estimated  as: 

P(*,y)  =  ^2  P(*s,y)  n  P(xj|xs,j/)  (8) 

sesv,  je{i,2,-",d}\s 

where  Sr,  is  the  collection  of  all  subsets  of  size  77  of  the  set  of  d  attributes  {1,2,---,  rf}; 
and  xs  is  a  77-dimensional  vector  of  values  of  x  defined  by  s. 

NB  is  a  member  of  A77DE  family  with  77  =  0,  A0DE.  AIDE  and  A2DE  pro¬ 
duce  better  predictive  accuracy  than  the  other  state-of-the-art  Bayesian  classifiers 
(Webb  et  al.,  2005,  2012).  However,  it  only  allows  dependencies  on  a  fixed  number 

of  attributes  and  y.  Because  of  the  high  time  complexity  of  O  1  and 

space  complexity  of  O  ^c(r/+1)vv+1'j ,  where  v  is  the  average  number  of  values  for 

an  attribute  (Webb  et  al.,  2012),  only  A2DE  or  A3DE  is  feasible  even  for  a  mod¬ 
erate  number  of  dimensions.  Furthermore,  selecting  an  appropriate  value  of  77  for 
a  particular  data  set  requires  a  search. 

All  of  the  Bayesian  classifiers  surveyed  so  far  estimate  the  joint  distribution 
as  the  product  of  one-dimensional  likelihoods.  Both  BayeNet  and  A77DE  allow 
limited  probabilistic  dependencies  among  attributes.  None  of  them  consider  the 
unrestricted  inter-dependencies  between  the  attributes,  and  they  all  have  limita¬ 
tions  in  terms  of  time  and  space  complexities  in  large  data  sets. 

DEMassBayes  (Ting  et  al.,  2013)  estimates  the  multi-dimensional  likelihood 
directly  without  making  any  explicit  assumptions  through  DEMass.  It  builds  t 
models  per  class  y  from  subsets  of  instances  belonging  to  class  y  in  each  Vi ,  i.e. , 
ViyV  C  Vi,  and  estimate  p(x\y)  as: 


p(xly)  =  jYl 


l(^.y(x))| 

|%,y|  Vi,y 


(^J  is  a  binomial  coefficient  of  rj  out  of  d. 


(9) 
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where  Vi,y  is  the  volume  of  the  region  Tqy(x). 

DEMassBayes  had  better  predictive  accuracy  than  AIDE  and  BayesNet  in  low 
dimensional  problems.  But,  it  had  poor  predictive  accuracy  in  problems  with  a 
moderate  number  of  dimensions.  This  is  because  volume  VitV  reduces  exponen¬ 
tially  and  becomes  very  small  as  the  parameter  controlling  the  number  of  regions 
increases;  and  the  estimation  of  p(x|y)  is  adversely  affected  by  regions  having  small 
volumes  with  few  instances. 


3  An  ensemble  approach  for  multi-dimensional  likelihood  estimation 


Even  though  DEMassBayes  has  some  limitations,  it  sets  a  benchmark  to  employ  an 
ensemble  approach  to  estimate  the  multi-dimensional  likelihood  from  sub-samples 
of  data.  The  approach  used  in  DEMassBayes  can  be  generalised  as  a  generic  en¬ 
semble  approach  to  estimate  p(x.\y)  as  follows: 

1  * 

p(xly)  =  X!PiCxly)  (i°) 

i=  1 


where  pi(x.\y)  is  estimated  from  a  fixed-size  sub-sample  C  D  (i  =  1,  2,  •  •  • ,  t), 
\T>i\  =ip  <n. 

From  a  small  sub-sample  T>i,  pi(x\y)  can  be  easily  estimated  by  using  existing 
data  modelling  techniques  such  as  density  estimation  or  other  means. 

DEMassBayes  is  one  realisation  of  Equation  10  that  estimates  pi(x\y)  through 
mass-based  density  estimation.  It  estimates  the  density  fi(x\y)  from  each  T>i  as: 


/i(x|y) 


\T^i,y\  Vi,y 


(11) 


From  fi(x\y),  pi(x\y)  can  be  estimated  as  the  probability  that  x  lies  in  a  small 
region  e. 

Pi{*\y)  =  J ^  /i(x|y)  dx  «  fi(x.\y)  x  volume(e)  (12) 

Since  volume(e)  is  a  constant,  it  can  be  ignored  in  classification  decision  and 
Pi(x\y)  can  be  expressed  as: 


Pi(*\y)  ~  fi(*\y)  (13) 

The  ensemble  approach  estimates  the  multi-dimensional  likelihood  directly 
from  data.  It  has  the  following  characteristics: 

•  The  average  over  t  estimators  provides  a  good  approximation  of  p(x|y)  as  it 
considers  distribution  over  different  local  neighbourhoods  of  x. 

•  As  it  employs  fixed-size  data  sub-samples,  it  has  constant  training  time  and 
constant  space  complexities. 

•  Existing  density  estimators,  which  have  problem  running  in  large  data  sets, 
can  now  be  used  as  long  as  ip  <C  n. 

•  In  large  data  sets  where  tip  <  n,  the  proposed  method  runs  faster  than  esti¬ 
mating  p(x|y)  using  the  entire  data  set  D. 
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•  The  time  complexity  can  be  further  reduced  by  a  factor  of  t  by  using  paral¬ 
lel  computing.  Each  estimator  in  the  ensemble  can  be  built  independently  in 
different  computing  node. 

The  proposed  ensemble  method  differs  from  the  existing  ensemble  methods  in 
classifier  learning  in  two  ways.  First,  ensemble  methods  such  as  Bagging  (Breiman, 
1996),  Boosting  (Freund  and  Schapire,  1996),  Random  Forest  (Breiman,  2001)  and 
Feating  (Ting  et  al.,  2011)  aggregate  the  posterior  probability  p(y |x)  from  different 
models  to  make  the  final  prediction,  whereas  the  proposed  method  aggregates 
p(x|y)  and  uses  the  Bayes  rule  to  make  the  final  prediction.  Second,  each  model  in 
the  existing  ensemble  methods  is  trained  with  a  sample  of  size  n,  but  each  model 
in  the  proposed  method  can  be  built  with  a  small  sub-sample  of  training  data 
(ip  <C  n)  and  the  ensemble  still  performs  well. 

Like  the  existing  ensemble  approach  of  A?;DE,  the  proposed  new  ensemble 
approach  avoids  search  in  learning  a  Bayesian  classifier.  But,  it  has  the  following 
distinguishing  characteristics  in  comparison  with  At?DE: 

•  A77DE  estimates  one-dimensional  likelihoods  given  a  fixed  number  of  super¬ 
parents  and  y ,  whereas  the  proposed  approach  estimates  multi-dimensional 
likelihoods  directly. 

•  In  A?yDE,  the  ensemble  size  is  fixed  to  (^) .  But,  the  proposed  approach  has 
the  flexibility  for  users  to  set  the  ensemble  size. 

•  Each  model  in  the  proposed  approach  is  built  with  training  sub-sample  of  size 
ip  <  n  which  gives  rise  to  the  constant  training  time.  In  contrast,  each  model 
in  A77DE  is  trained  using  the  entire  training  set. 

•  A77DE  is  a  deterministic  algorithm  whereas  the  proposed  approach  is  a  ran¬ 
domised  algorithm. 

Using  the  ensemble  approach  to  estimate  p(x|y),  we  introduce  two  new  Bayesian 
classifiers  using  two  new  realisations  of  Equation  10:  ENNBayes  and  MassBayes. 
ENNBayes  uses  a  nearest  neighbour  density  estimator  to  estimate  pi(x|y).  Mass¬ 
Bayes  partitions  the  data  space  of  T>i  to  create  local  regions  and  then  estimates 
Pi(x\y)  from  the  numbers  of  instances  in  the  local  region  (LR)  and  T>i.  ENNBayes  is 
a  lazy  approach  that  defines  LRs  implicitly  in  terms  of  nearest  neighbours  when  the 
test  instance  is  presented.  MassBayes  is  an  eager  learner  that  explicitly  partitions 
the  feature  space  enclosed  by  .  The  conceptual  differences  between  ENNBayes, 
MassBayes  and  DEMass-Bayes  are  provided  in  Table  1.  ENNBayes  and  MassBayes 
are  described  in  the  next  two  subsections. 


Table  1:  Three  realisations  of  the  proposed  ensemble  approach  to  estimate  p(x|y): 
ENNBayes,  MassBayes  and  DEMass-Bayes 


Classifiers 

Defining  Local  Regions 

Estimating  pi(x\y) 

Model 

ENNBayes 

Implicit  with  nearest 
neighbours. 

Nearest  neighbour 
density  estimation. 

No  model 

MassBayes 

Explicit  with  feature 

Probability  estimation 

One  model  for  all 

space  partitioning. 

in  local  regions. 

classes. 

DEMassBayes 

Explicit  with  feature 

Density  estimation 

One  model  per 

space  partitioning. 

based  on  mass. 

class. 
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3.1  ENNBayes:  An  ensemble  of  nearest  neighbour  density  estimators 


We  call  the  new  Bayesian  classifier  that  estimates  the  likelihood  using  an  ensem¬ 
ble  of  nearest  neighbour  density  estimators  ENNBayes  (Ensemble  of  Nearest 
Neighbour  Bayesian  classifier). 

Using  a  random  sub-sample  of  training  instances  V, ,  the  conditional  density 
fi(x\y)  can  be  estimated  as: 


/i(x|y) 


_ |JV(x,fc|Pj,„)| _ 

\Vi,y\  E  distance(x,  x.') 

x'  G  iV(x,fc  \T>i^y) 


(14) 


where  is  a  set  of  instances  belonging  to  class  y  in  T>i,  (J  VitV  =  Vi\  and 
JV(x,  fc|Di,y)  is  the  set  of  k  nearest  neighbours  of  x  in  VitV. 

Since  the  nearest  neighbour  instance  contributes  significantly  to  fi(x\y),  the 
above  equation  can  be  simplified  by  considering  the  nearest  neighbour  only  (i.e., 
k  =  1). 

In  the  case  of  skewed  class  distribution,  instances  from  some  classes  may  not 
present  in  Vi.  If  there  is  no  instance  from  any  class  in  Vi ,  pi(x.\y)  for  that  class  is 
zero.  Hence,  pi(x\y)  from  Vt  is  estimated  as: 


Pi(x\y) 


\‘Di.y  |  X  distance(x.,N  N  (x|7?;-  y ))  ^  \D%,y\  >  0 

0  otherwise. 


(15) 


where  NN(x\Vi^v)  is  the  nearest  neighbour  of  x  in  Vi^y. 

With  k  =  1,  the  final  estimate  considers  different  local  neighbourhoods  of 
the  test  instance  (the  nearest  neighbour  in  each  Vi).  An  illustrative  example  of 
implicit  local  regions  centered  on  the  nearest  neighbour  of  x  in  different  sub¬ 
samples  is  shown  in  Figure  1.  The  shape  of  the  regions  depend  on  the  distance 
measure  (Lp-norm)  used:  hypersphere  if  p  =  2  (Euclidean  distance)  and  hypercube 
if  p  =  oo. 

ENNBayes  is  a  lazy  learning  approach.  There  is  no  model  building  process.  In 
the  training  phase,  each  data  sub-sample  Vi  C  D  ( i  =  1,2,-  -  - ,  t)  is  constructed 


Fig.  1:  Implicit  regions  to  estimate  Pi(x)  using  Euclidean  distance  (L2-norm). 
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by  sampling  ip  instances  from  D  without  replacement.  The  sampling  process  is 
restarted  with  D  when  all  the  instances  in  D  are  used.  In  the  testing  phase,  the 
nearest  neighbour  of  the  test  instance  x  in  each  class  is  searched  in  each  sub-sample 
T>i  and  pi(x|j/)  is  estimated  from  Equation  15. 

ENNBayes  requires  to  store  tip  instances.  Hence,  the  space  complexity  is  0(tipd). 
In  order  to  classify  a  test  instance,  the  nearest  neighbour  in  each  class  is  searched 
in  each  of  the  t  sub-samples.  If  <p  is  the  average  number  of  instances  per  class  in 
each  T>i,  the  total  testing  time  complexity  is  O(ctipd).  Even  though  ENNBayes 
reduces  the  time  complexity  to  constant  in  terms  of  n,  it  still  requires  distance 
calculation  in  order  to  find  the  nearest  neighbour. 


3.2  MassBayes:  Multi-dimensional  likelihood  estimation  through  probability 
estimation  in  a  local  region  defined  by  feature  space  partitioning 

MassBayes  estimates  multi-dimensional  likelihood  p,  (x|y)  through  probability  es¬ 
timation  in  local  region.  Instead  of  converting  continuous-valued  attributes  to 
discrete  attributes  through  discretisation  (as  done  by  existing  Bayesian  classifiers, 
as  a  preprocessing  step),  MassBayes  defines  local  regions  through  feature  space 
partitioning  in  the  model  building  process. 

Let  T,  (-)  be  a  function  that  divides  the  data  space  of  Vi  into  non-overlapping 
regions  and  Tt(x)  be  the  smallest  local  region  where  x  falls  into.  The  likelihood 
Pi{x\y)  is  defined  as  follows: 


Pi{*\y) 


\Ti(^y)\ 


(16) 


where  |T,(x,  y)|  is  the  number  of  instances  belonging  to  class  y  in  T)  (x)  and  \T>itV \ 
is  the  number  of  instances  belonging  to  class  y  in  Vi. 

The  feature  space  partitioning  is  constructed  using  a  binary  tree  structure 
called  h:d-tree,  as  used  by  Ting  and  Wells  (2010)  in  multi-dimensional  mass  esti¬ 
mation.  The  data  space  enclosing  instances  in  Vi  is  sub-divided  into  two  half-spaces 
by  splitting  at  the  mid-point  on  a  dimension.  The  process  is  then  repeated  in  each 
non-empty  half-space  with  more  than  one  instance  in  it  on  a  different  dimension. 

In  a  multi-dimensional  space,  each  instance  in  Vt  can  be  isolated  by  splitting 
on  a  few  dimensions  i.e.,  only  a  subset  of  d  attributes,  g  C  {1,  2,  •  •  • ,  d}  is  used  to 
define  T)(x).  Hence,  pi(x.\y)  is  estimated  as  pi(xg|y)  where  xg  is  a  |g|-dimensional 
vector  of  values  of  x  defined  by  g;  and  1  <  |g|  <  d. 


Pi(*\y)  =  Pi(xg|y)  =  (17) 

The  regions  T,  (x)  from  different  Vi  are  represented  by  different  feature  subsets. 
Hence,  the  multi-dimensional  likelihood  p(x|y)  can  be  expressed  as: 

p(xly)  =  t  P^slv)  (is) 

g  €Gt 

where  Gt  is  a  collection  of  t  subsets  of  varying  sizes  of  d  attributes  used  to  define 
Ti(x)  in  each  Vi. 
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Fig.  2:  Different  regions  from  different  Tj(-)  (i  =  1,  2,  •  •  • ,  5)  that  cover  x. 

The  average  probability  of  t  different  regions  Tt(x)  (i  =  1,2 ,■■■  ,t),  constructed 
using  different  T>i,  provides  a  good  estimate  of  p(x|j/)  as  it  considers  the  distri¬ 
bution  in  different  local  neighbourhoods  of  x  in  the  data  space.  An  illustrative 
example  of  different  regions  covering  x  is  provided  in  Figure  2. 

As  our  implementation  is  motivated  from  the  multi-dimension  mass  estimation 
by  Ting  and  Wells  (2010),  we  call  the  resulting  Bayesian  classifier  MassBayes. 
MassBayes  estimates  p(x|j/)  by  aggregating  the  multi-dimensional  likelihoods  esti¬ 
mated  from  random  subsets  of  the  training  data  using  varying-size  random  feature 
subsets. 


3.2.1  Implementation 

We  use  the  same  algorithm  as  used  by  Ting  and  Wells  (2010)  to  generate  h:d-trees 
to  represent  Tj(-)  with  the  following  modification:  the  tree  building  process  stops 
early  once  every  instance  is  isolated.  In  the  original  implementation,  each  tree  is 
built  to  the  maximum  height  of  h  x  d  (where  h  is  a  parameter  that  defines  the 
maximum  level  of  sub-divisions)  resulting  in  equal-size  regions.  Even  in  a  moderate 
number  of  dimensions,  many  of  the  regions  remain  empty,  and  each  non-empty 
region  encloses  a  very  small  area  around  the  observed  data  instances  resulting  in 
a  poor  estimate  for  unseen  instances.  The  procedures  to  generate  t  trees  from  a 
given  data  set  D  are  provided  in  Algorithms  1  and  2  in  Appendix  A. 

Let  the  data  space  that  envelops  the  instances  in  V  be  A.  The  data  space  A 
is  adjusted  to  become  5  using  a  random  perturbation  conducted  as  follows.  For 
each  dimension  j,  a  split  point  Vj  is  chosen  randomly  within  the  range  maXj(A)  — 
minj(A).  Then,  the  new  range  5j  along  dimension  j  is  defined  as  [vj  —  r,  Vj  T  r], 
where  r  =  max{vj  —  minj(A),maXj(A)  —  Vj).  The  new  range  on  all  dimensions 
defines  the  adjusted  work  space  for  the  tree  building  process.  The  random  adjust¬ 
ment  of  the  work  space  ensures  that  no  two  trees  are  identical  even  if  they  are 
constructed  from  the  same  subset  of  instances  and  covers  a  wider  area  than  A. 

The  dimension  to  split  is  selected  from  a  randomised  set  of  d  dimensions  in 
a  round-robin  manner  at  each  level  of  a  tree.  The  maximum  allowed  height  of  a 
tree  is  h  x  d.  At  the  leaf  node,  the  number  of  instances  belonging  to  each  class  is 
stored. 
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Fig.  3:  An  example  of  an  h. id-tree  for  h  =  2  and  d  =  2. 


A  typical  example  of  an  implementation  of  T(-)  as  an  h:d-tree  for  h  =  2  and 
d  =  2  is  shown  in  Figure  3.  The  dotted  box  envelopes  the  instances  in  T>  and  the 
outer  solid  box  represents  the  adjusted  work  space  which  has  ranges  5\  and  62  on 
xi  and  X2  dimensions.  Rl,  R2,  R3,  RA  and  R5  represent  different  regions  in  T(-) 
depending  on  the  data  distribution  in  V.  R1  is  defined  by  splitting  the  work  space 
in  x\  dimension  only  (i.e.,  g  =  {1}),  whereas  the  other  four  regions  use  dimensions 
xi  and  X2  (i.e.,  g  =  {1,2}). 

In  the  worst  case,  an  h:d-tree  has  height  hd.  At  each  level  of  a  tree,  a  total  of  ip 
instances  are  assigned  to  all  nodes.  Hence,  the  time  complexity  to  construct  a  single 
tree  is  0(hdip).  Since  t  such  trees  are  constructed,  the  training  time  complexity 
is  0(thdip).  When  classifying  a  test  instance,  each  of  the  t  trees  is  traversed  from 
the  root  to  a  leaf  node  where  the  test  instance  falls  into.  The  complexity  of  that 
searching  is  0(thd).  Each  leaf  node  represents  a  region  having  a  value  range  (with 
the  minimum  and  the  maximum  value)  in  each  of  the  d  dimensions  and  the  number 
of  instances  belonging  to  each  of  the  c  classes.  There  is  a  total  of  min( 2hd,  ip)  leaf 
nodes  in  each  of  the  t  trees.  In  general,  ip  <  2hd;  thus,  the  total  space  complexity 
is  0(t(c  +  d)ip).  During  the  tree  building  process,  an  additional  space  of  nd  is 
required  to  store  n  training  instances  which  can  be  discarded  once  the  trees  are 
constructed. 


3.3  Proposed  Bayesian  classifiers  versus  existing  Bayesian  classifier 

The  decision  rules  of  existing  and  the  proposed  Bayesian  classifiers  are  provided  in 
Table  2.  Note  that  the  existing  Bayesian  classifiers  (NB,  BayesNet  and  A77DE)  esti¬ 
mate  the  conditional  probability  as  the  product  of  one-dimensional  likelihoods  as¬ 
suming  some  kind  of  conditional  independence.  In  contrast,  DEMassBayes,  Mass- 
Bayes  and  ENNBayes  estimate  multi-dimensional  likelihoods  directly. 

The  time  and  space  complexities  of  two  variants  of  NB,  A77DE,  DEMassBayes, 
ENNBayes  and  MassBayes  are  presented  in  Table  3.  Note  that  these  complexi- 
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Table  2:  Decision  rules  of  different  Bayesian  classifiers. 


Classifier 

Decision  Rule 

Remarks 

NB-KDE 

d 

argmax  p{y)\\p{xi\y) 

y  A  A 

i=  1 

p(xi\y)  is  estimated  with 
KDE  (Gaussian  kernel). 

NB-Disc 

p(Xi\y )  is  estimated 
through  discretisation. 

BayesNet 

d 

argmax  p{ny,y)\\p(xi\-Ki,y) 
y  A  -1- 

i= 1 

iv  i  =  parent(xi ), 

7 vy  =  parent(y); 
Probabilities  are  estimat¬ 
ed  through  discretisation. 

ArjDE 

argmax  ^  p(xs,  y)  R  p(xj\xs,y) 

s  GSrj  j  £{l,2,"-,d}\s 

p(xj\~x.s,y)  is  estimated 
through  discretisation. 

DEMassBayes 

t 

argmax  p(y)  fi(x.\y) 

y  t  z ' 

i=  1 

fi(x\y)  is  estimated 
using  Equation  11. 

ENNBayes 

fi(x\y)  is  estimated 
using  Equation  15. 

MassBayes 

argmax  p(y)  i  p(xg|j/) 

y  t  ^ ' 

g€Gt 

p(xg|y)  is  estimated 
using  Equation  17. 

Table  3:  Time  and  space  complexities  of  the  existing  (NB  and  A77DE)  and  the 
proposed  (MassBayes  and  ENNBayes)  Bayesian  classifiers. 


Classifiers 

Training  time 

Testing  time 

Space 

NB-KDE* 

Opnd ) 

O(cmd) 

O(cmd) 

NB-Disct 

0(nd) 

0(cd) 

O(cdv) 

AfyDEt 

°(nC,+i)) 

°«)) 

0(c(4i)^+1) 

DEMassBayes* 

0(cthd(p) 

O(cthd) 

O(cfdy) 

MassBayes 

0(thd^) 

0(thd ) 

O  (tip{d  +  c)) 

ENNBayes 

- 

0(ct(pd) 

0(tipd) 

*  Langley  and  John  (1995),  f  Webb  et  al.  (2005),  $  Webb  et  al.  (2012),  *  Ting  et  al.  (2013) 
n:  total  number  of  training  instances,  m:  average  number  of  training  instances  in  a  class,  d: 
number  of  dimensions,  c:  number  of  classes,  v:  average  number  of  discrete  values  of  an 
attribute,  77:  number  of  super-parents,  t :  number  of  trees,  h:  level  of  divisions  ip:  sample  size 
\T>i\,  and  y:  average  number  of  samples  per  class  in  Du¬ 


ties  do  not  include  the  additional  discretisation  needed  in  the  preprocessing  for 
NB-Disc  and  A77DE.  Both  training  time  complexity  and  space  complexity  of  DE- 
MassBayes,  ENNBayes  and  MassBayes  are  independent  of  n. 

Note  that  the  average  case  training  and  testing  time  complexities  of  DEMass- 
Bayes  and  MassBayes  are  a  lot  better  than  the  worst  case  complexities  presented 
in  Table  3  because  the  tree  building  process  terminates  early  once  every  instance 
is  isolated.  With  moderate  values  of  d  and  h,  the  average  height  of  the  tree  will  be 
a  lot  less  than  the  maximum  height  of  hd.  Hence,  DEMassBayes  and  MassBayes 
run  faster  than  ENNBayes. 
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4  Empirical  evaluation 

This  section  presents  the  results  of  the  experiments  conducted  to  evaluate  the 
performance  of  ENNBayes  and  MassBayes  against  well  known  Bayesian  classifiers. 
Since  our  focus  is  in  the  Bayesian  approach  of  classifier  learning,  we  chose  the  state- 
of-the-art  Bayesian  contenders:  three  variants  of  A?yDE  (AIDE,  A2DE,  A3DE), 
BayesNet,  two  variants  of  NB  (NB-KDE  and  NB-Disc)  and  DEMassBayes.  Webb 
et  al.  (2012)  showed  that  A2DE  is  better  or  at  least  competitive  to  state-of-the-art 
ensemble  approaches  of  Feating  (Ting  et  al.,  2011)  and  Random  Forest  (Breiman, 
2001). 

ENNBayes  and  MassBayes  were  implemented  in  Java  using  the  WEKA  plat¬ 
form  (Hall  et  al.,  2009)  which  also  has  implementations  of  NB,  BayesNet  and 
AIDE.  For  DEMassBayes,  A2DE  and  A3DE,  we  used  the  WEKA  implementa¬ 
tions  provided  by  the  respective  authors. 

Fifteen  data  sets  with  n  >  10000  were  used.  All  the  attributes  in  the  data 
sets  were  numeric.  The  properties  of  the  data  sets  are  provided  in  Table  4.  The 
RingCurve,  Wave  and  OneBig  data  sets  were  three  synthetic  data  sets  and  the 
rest  were  real-world  data  sets  from  UCI  Machine  Learning  Repository  (Frank 
and  Asuncion,  2010).  RingCurve  and  Wave  are  subsets  of  the  RingCurve- Wave- 
TriGaussian  data  set  used  by  Ting  and  Wells  (2010)  and  OneBig  is  the  data  set 
used  by  Nanopoulos  et  al.  (2006). 


Table  4:  Properties  of  data  sets  used. 


Data  sets 

Data  size 

Dimensions 

Classes 

KDDCup99 

5209460 

32 

40 

CoverType 

581012 

10 

7 

YearPrediction 

515345 

90 

2 

Census 

299285 

7 

2 

SkinSegment 

245057 

3 

2 

Localisation 

164860 

3 

11 

MiniBooNE 

129596 

50 

2 

OneBig 

68000 

20 

10 

Shuttle 

58000 

8 

7 

Letters 

20000 

16 

26 

RingCurve 

20000 

2 

2 

Wave 

20000 

2 

2 

Magic04 

19020 

10 

2 

GasSensor 

13790 

128 

6 

Pendigits 

10992 

16 

10 

All  the  experiments  were  conducted  in  a  Linux  machine  with  2.27  GHz  proces¬ 
sor  and  20  GB  memory.  The  performance  measures  were  the  classification  accuracy 
and  the  CPU  runtime.  In  large  data  sets  (n  >  100000),  the  data  sets  were  divided 
into  two  folds,  one  for  training  and  the  other  one  for  testing.  The  classification  ac¬ 
curacy  on  the  test  set  and  the  total  runtime  (training  and  testing)  were  reported. 
In  small  data  sets  (n  <  100000),  a  10-fold  cross  validation  was  used  and  average 
accuracy  and  average  runtime  (training  and  testing)  were  reported. 

A  two-standard-error  significance  test  was  conducted  to  check  whether  the 
difference  in  accuracies  of  two  competing  classifiers  was  significant.  A  win  or  loss 
was  counted  if  the  difference  was  significant;  otherwise,  it  was  a  draw. 
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For  A77DE,  BayesNet  and  NB-Disc,  data  sets  were  discretised  by  a  supervised 
discretisation  technique  based  on  minimum  entropy  (Fayyad  and  Irani,  1995)  as 
suggested  by  the  authors  of  AtyDE  before  building  the  classification  models.  The 
reported  runtime  did  not  include  the  additional  discretisation  time  that  was  done 
in  preprocessing. 

For  BayesNet,  the  parameter  ‘maximum  number  of  parents’  was  set  to  100 
to  examine  whether  a  large  number  of  parents  produces  better  results;  and  the 
parameter  ‘initialise  as  Naive  Bayes’  was  set  to  ‘false’  to  initialise  an  empty 
network  structure.  The  default  values  were  used  for  the  rest  of  the  parameters. 

In  MassBayes  and  DEMassBayes,  the  parameters  t ,  if  and  h  were  set  as  default 
to  100,  5000  and  10,  respectively;  whereas  the  parameters  t  and  if  in  ENNBayes 
were  set  as  default  to  25  and  5000,  respectively.  Euclidean  distance  (L2-nornr)  was 
used  to  find  the  nearest  neighbour  in  ENNBayes. 

We  discuss  the  experimental  results  in  detail  in  the  following  three  subsections. 
The  results  in  terms  of  classification  accuracy  and  runtime  are  analysed  in  Section 
4.1  and  4.2,  respectively.  We  analyse  the  sensitivity  of  the  parameters  in  ENNBayes 
and  MassBayes  in  Section  4.3. 


4.1  Classification  accuracy  comparison 

A  summary  of  the  comparison  is  presented  in  Table  5  which  shows  the  win:loss:draw 
counts  of  MassBayes  and  ENNBayes  against  the  other  contenders  based  on  the 
two-standard-error  significance  test.  Both  ENNBayes  and  MassBayes  perform  sig¬ 
nificantly  better  than  NB,  AIDE  and  DEMass-Bayes  in  the  complete  set  of  15 
data  sets.  BayesNet,  A2DE  and  A3DE  cannot  complete  in  all  15  data  sets.  Both 
ENNBayes  and  MassBayes  still  have  more  wins  than  losses  in  comparison  with 
these  classifiers.  The  complete  result  of  ENNBayes,  MassBayes  and  other  con¬ 
tenders  are  provided  in  Tables  7  in  Appendix  B. 

We  discuss  the  classification  results  of  ENNBayes  and  MassBayes  against 
At^DE,  BayesNet  and  NB,  and  DEMassBayes  separately  in  the  following  three 
subsections. 


Table  5:  Win:Loss:Draw  counts  of  MassBayes  and  ENNBayes  against  the  other 
contenders  in  terms  of  accuracy  based  on  the  two-standard-error  significance  test. 
Note  that  A3DE  and  BayesNet  did  not  complete  in  five  data  sets  and  A2DE  did 
not  complete  in  two  data  sets. 


Contenders 

MassBayes 

ENNBayes 

A3DE 

5:3:2 

5:4:1 

A2DE 

7:5:1 

8:2:3 

AIDE 

10:1:4 

12:1:2 

BayesNet 

6:3:1 

6:2:2 

NB-KDE 

15:0:0 

13:2:0 

NB-Disc 

15:0:0 

14:1:0 

DEMass-Bayes 

10:0:5 

7:1:7 
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4-1.1  ENNBayes  and  MassBayes  versus  AijDE 

Since  A77DE  is  the  key  contender,  we  analyse  the  result  in  more  details  in  this 
subsection.  The  accuracies  of  ENNBayes  and  MassBayes  in  all  the  data  sets  were 
plotted  against  the  accuracies  of  each  of  the  three  variants  of  A77DE  in  Figure  4. 
The  coordinate  values  of  each  point  in  the  plot  are  the  accuracies  of  each  pair  of 
classifiers  in  a  data  set.  If  both  the  classifiers  had  produced  the  same  accuracies  in 
a  data  set,  the  point  representing  that  data  set  lies  on  the  diagonal.  In  both  the 
plots,  many  points  lies  below  the  diagonal  line  and  only  a  few  points  are  above. 
This  indicates  that  both  the  proposed  methods  are  better  than  A77DE  in  many 
cases. 


MassBayes 


(b)  MassBayes  versus  Ar/DE 


Fig.  4:  Scatter  plot  of  accuracies  of  ENNBayes  and  MassBayes  versus  those  of  the 
three  variants  of  A77DE  (AIDE,  A2DE  and  A3DE) 


ENNBayes  had  12  wins  and  1  loss  against  AIDE;  8  wins  and  2  losses  against 
A2DE;  and  5  wins  and  4  losses  against  A3DE.  Similarly,  MassBayes  had  10  wins 
and  1  loss  against  AIDE;  7  wins  and  5  losses  against  A2DE;  and  5  wins  and  3 
losses  against  A3DE. 

It  is  interesting  to  note  that  A3DE  did  not  complete  in  five  out  of  the  fifteen 
data  sets  used.  In  OneBig  (d  =  20),  KDDCup  ( d  =  32),  MiniBooNE  (d  =  50), 
YearPrediction  (d  =  92)  and  GasSensor  (d  =  128),  it  did  not  complete  because  of 
the  arithmetic  overflow.  The  memory  requirement  in  A77DE  increases  with  d  and  c 
as  it  requires  (?7+2)-dimensional  probability  table  to  store  the  observed  frequency 
for  each  combination  of  (77+ 1)  attribute  values  and  the  class  values  (Webb  et  al., 
2012).  Table  6  shows  the  increase  in  memory  with  the  increase  in  the  number  of 
dimensions  and  classes  in  three  variants  of  A77DE.  A3DE  can  only  be  used  in  low 
dimensional  data  sets  (d  <  20).  It  can  not  handle  data  sets  even  with  a  moderate 
number  of  dimensions.  Similarly,  A2DE  did  not  complete  in  KDDCup  (d  =  32, 
c  =  40)  and  GasSensor  (d  =  128,  c  =  6).  A2DE  can  handle  more  dimensions 
than  A3DE  but  still  fails  in  data  sets  with  a  moderate  number  of  dimensions  and 
classes.  AIDE,  which  can  handle  more  attributes  than  A2DE,  completed  in  all 
data  sets. 
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Table  6:  Memory  required  by  three  variants  of  AtyDE  in  different  data  sets,  v  is 
the  average  number  of  discrete  values  per  attribute. 


Data  sets 

#d 

#C 

Memory  per  class 

Total  memory 

AIDE 

A2DE 

A3DE 

AIDE 

A2DE 

A3DE 

Census 

7 

2 

zHF 

35u3 

35u4 

42u2 

70u3 

70P4 

Shuttle 

8 

7 

28u2 

56u3 

70u4 

196u2 

392u3 

490u4 

CoverType 

10 

7 

45u2 

120u3 

210u4 

315u2 

840u3 

1470u4 

Letters 

16 

26 

120t>2 

560u3 

1820u4 

3120ir 

14560u3 

47320u4 

OneBig 

20 

10 

190u2 

1140d3 

4845u4 

1900u2 

11400u3 

48450u4 

KDDCup99 

32 

40 

496u2 

4960u3 

35960u4 

19840u2 

198400u3 

1438400u4 

MiniBooNE 

50 

2 

1225u2 

19600u3 

230300u4 

2450u2 

39200u3 

460600n4 

YearPrediction 

90 

2 

4005u2 

117480u3 

2555190u4 

1080u2 

234960u3 

5110380u4 

GasSensor 

128 

6 

8128u2 

341376u3 

10668000u4 

48768n2 

2048256u3 

64008000u4 

4-1-2  ENNBayes  and  MassBayes  versus  BayesNet  and  Naive  Bayes 


We  focus  on  the  comparison  with  the  second  key  contenders  BayesNet  and  Naive 
Bayes  in  this  subsection. 

In  the  scatter  plots  in  Figure  5,  most  of  the  points  are  below  the  diagonal  line. 
This  shows  that  ENNBayes  and  MassBayes  are  better  than  BayesNet  and  Naive 
Bayes.  ENNBayes  had  6  wins  and  2  losses  against  BayesNet;  13  wins  and  2  losses 
against  NB-KDE;  and  14  wins  and  1  loss  against  NB-Disc.  Similarly,  MassBayes 
had  6  wins  and  3  losses  against  BayesNet;  and  all  15  wins  over  both  the  variants 
of  NB  (NB-KDE  and  NB-Disc). 

Note  that  BayesNet  did  not  complete  in  five  data  sets  with  a  moderate  number 
of  dimensions  -  OneBig  ( d  =  20),  KDDCup99  ( d  =  32),  MiniBooNE  (d  =  50), 
YearPrediction  ( d  =  90)  and  GasSensor  ( d  =  128)  with  out  of  memory  error  when 
a  computing  cluster  node  with  20GB  was  used.  But,  ENNBayes  and  MassBayes 
completed  in  all  15  data  sets  using  the  same  machine. 


(a)  ENNBayes  versus  BayesNet  and  NB  (b)  MassBayes  versus  BayesNet  and  NB 


Fig.  5:  Scatter  plot  of  accuracies  of  ENNBayes  and  MassBayes  versus  those  of 
BayesNet  and  the  two  variants  of  Naive  Bayes  (NB-KDE  and  NB-Disc). 
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4- 1.3  MassBayes  and  ENNBayes  versus  DEMassBayes 

This  subsection  focuses  on  comparison  with  DEMassBayes,  a  variant  of  the  same 
generic  ensemble  approach. 

As  shown  in  the  scatter  plots  in  Figure  6,  both  ENNBayes  and  MassBayes 
produced  better  classification  accuracies  than  DEMassBayes.  MassBayes  had  10 
wins  and  no  loss  against  DEMassBayes  whereas  ENNBayes  had  7  wins  and  one 
loss.  DEMassBayes  produced  competitive  results  to  MassBayes  and  ENNBayes 
in  low  dimensional  data  sets  such  as  RingCurve  ( d  =  2),  Wave  ( d  =  2),  Skin- 
Segment  (d  =  3),  Localisation  (d  =  3),  Census  (dm  7)  and  Shuttle  (d  =  8). 
It  had  significantly  worse  results  than  both  MassBayes  and  ENNBayes  in  data 
sets  with  a  moderate  number  of  dimensions  such  as  KDDCup99  (d  =  32),  Mini- 
BooNE  (d  =  50),  YearPrediction  (d  =  90)  and  GasSensor  (d  =  128).  When  the 
number  of  dimensions  increases,  the  volume  of  the  regions  decreases  exponential 
which  heavily  influence  the  density  estimation  and  degrades  the  performance  of 
DEMassBayes. 


Fig.  6:  Scatter  plot  of  accuracies  of  ENNBayes  and  MassBayes  versus  those  of 
DEMassBayes. 


Fig.  7:  Scatter  plot  of  accuracies  of  ENNBayes  versus  MassBayes. 


Multidimensional  likelihood  estimation 


19 


4-1-4  ENNBayes  versus  MassBayes 

As  shown  in  Figure  7,  ENNBayes  and  MassBayes  produced  competitive  results  in 
many  data  sets.  ENNBayes  had  5  wins,  2  losses  and  8  draws  against  MassBayes. 


4.2  Runtime  comparison 

4-2.1  ENNBayes  and  MassBayes  versus  existing  Bayesian  classifiers 

In  terms  of  runtime,  ENNBayes  was  generally  one  to  three  orders  of  magnitude 
slower  than  the  existing  Bayesian  classifiers.  But,  the  runtime  of  MassBayes  was 
an  order  of  magnitude  faster  than  some  contenders  (such  as  A2DE  and  NB-KDE) 
in  large  data  sets  and  it  was  competitive  to  many  existing  Bayesian  classifiers  in 
many  data  sets. 

The  runtime  of  the  proposed  and  the  existing  Bayesian  classifiers  in  the  three 
largest  data  sets  -  KDDCup99,  YearPrediction  and  CoverType  was  presented  in 
Figure  8.  In  the  largest  data  set  KDDCup99,  the  runtime  of  MassBayes  was  of  the 
same  order  of  magnitude  as  DEMassBayes,  AIDE  and  NB-KDE  whereas  it  was 
an  order  of  magnitude  slower  than  NB-Disc.  A2DE,  A3DE  and  BayesNet  did  not 
complete  in  KDDCup99.  In  YearPrediction,  MassBayes  was  an  order  of  magni¬ 
tude  faster  than  A2DE  and  NB-KDE;  and  was  of  the  same  order  of  magnitude  as 
DEMassBayes  and  AIDE.  A3DE  and  BayesNet  did  not  complete  in  the  YearPre¬ 
diction  data  set.  In  CoverType,  the  runtime  of  MassBayes  was  of  the  same  order  of 
magnitude  as  DEMassBayes,  A3DE,  BayesNet  and  NB-KDE,  but  it  was  an  order 
of  magnitude  slower  than  A2DE,  AIDE  and  NB-Disc.  ENNBayes  was  one  to  three 
orders  of  magnitude  slower  than  the  existing  Bayesian  classifiers  in  all  three  data 
sets. 

The  complete  runtime  results  of  ENNBayes,  MassBayes  and  other  contenders 
are  provided  in  Tables  8  in  Appendix  B. 

Note  that  the  reported  runtime  results  for  A77DE,  BayesNet  and  NB-Disc  did 
not  include  the  discretisation  time  that  must  be  done  as  a  preprocessing  step, 
which  give  A77DE,  BayesNet  and  NB-Disc  an  unfair  advantage  over  the  proposed 
classifiers.  The  discretisation  cost  is  significantly  large  in  large  and  moderately 
high  dimensional  data  sets.  For  examples,  the  discretisation  took  1290  seconds 
in  the  KddCup99  data  set,  and  467  seconds  in  the  YearPrediction  data  set.  The 
discretisation  time  itself  was  of  the  same  order  of  magnitude  as  the  runtime  of 
MassBayes.  The  runtime  of  the  supervised  discretisation  (Fayyad  and  Irani,  1995) 
in  all  the  data  sets  are  provided  in  Table  9  in  Appendix  C. 

The  head-to-head  comparison  of  runtime  of  ENNBayes,  MassBayes  and  DE¬ 
MassBayes  against  the  existing  Bayesian  classifiers  (A77DE,  BayesNet  and  NB) 
is  not  entirely  fair.  The  existing  Bayesian  classifiers  assume  some  kind  of  condi¬ 
tional  independence  and  estimate  the  simplified  surrogates  of  p(x|y).  In  contrast, 
ENNBayes,  MassBayes  and  DEMassBayes  estimate  p(x|y)  directly  from  the  given 
training  data. 

Among  the  three  classifiers  based  on  the  proposed  ensemble  approach,  ENNBayes 
was  one  to  three  orders  of  magnitude  slower  than  MassBayes  and  DEMassBayes. 
MassBayes  and  DEMassBayes  had  runtime  of  the  same  order  of  magnitude.  Among 
the  three  variants  of  A77DE,  the  runtime  increases  with  the  increase  in  77. 
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KDDCup99  YearPrediction  CoverType 

Fig.  8:  Runtime  of  MassBayes,  ENNBayes  and  the  other  contenders  in  the  three 
largest  data  sets:  KDDCup99,  CoverType  and  YearPrediction.  The  vertical  axis  is 
on  a  logarithmic  scale  of  base  10.  For  ease  of  reading,  the  classifiers  are  organised 
into  groups  of  three:  the  first  group  has  three  classifiers  (MassBayes,  ENNBayes 
and  DEMassBayes)  based  on  the  proposed  ensemble  approach;  the  second  group 
has  three  variants  of  A^DE  (A3DE,  A2DE  and  AIDE);  and  the  last  group  has 
BayesNet,  NB-KDE  and  NB-Disc.  Note  that  the  discretisation  time  was  not  in¬ 
cluded  in  the  runtime  of  A77DE,  BayesNet  and  NB-Disc.  Histograms  with  star 
which  have  the  maximum  height  indicate  that  the  classifiers  did  not  complete  the 
tasks. 


The  above  runtime  results  do  not  provide  a  detailed  picture  about  the  scalabil¬ 
ity  of  the  proposed  classifiers.  Hence,  we  examine  the  scalability  of  the  proposed 
classifiers  to  present  a  more  accurate  idea  about  their  time  complexities  in  the 
following  subsection. 


4-2.2  Scaleup  test 

In  order  to  examine  how  well  the  classifiers  scaleup  to  large  data  sets,  we  used  a 
subset  of  the  KDDCup99  data  set  with  the  three  largest  classes  (d  =  32,  c  =  3)1. 
The  training  data  size  was  increased  from  10000  to  50000,  100000,  half-a-million, 
million  and  five  million  with  a  factor  of  1,  5,  10,  50,  100  and  500,  respectively.  The 
test  set  had  10000  instances.  The  increase  in  training  time  and  space  required  to 
store  the  classification  model  of  MassBayes  and  three  variants  of  A77DE  is  presented 


1 


The  number  of  classes  must  be  reduced  for  Ar]\)K  to  run 
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Fig.  9:  Increase  in  training  time  and  memory  requirement  to  learn  a  classification 
model  with  the  increase  in  training  size  in  a  subset  of  KDDCup99  data  set  with 
three  largest  classes  (i.e.,  c  =  3,  d  =  32).  The  horizontal  axes  are  on  a  logarithmic 
scale  of  base  10.  A3DE  did  not  complete  when  the  training  size  was  increased  to 
five  million. 
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in  Figure  92.  The  increase  in  training  time  and  space  required  is  presented  as  a 
ratio  with  the  training  time  and  space  required  for  10000  training  instances  as 
the  base.  Note  that  A77DE  had  an  unfair  advantage  over  MassBayes  in  terms  of 
runtime  as  the  discretisation  time  was  not  included  in  the  presented  results. 

With  the  increase  in  training  size  by  a  factor  of  5,  10,  50,  100  and  500,  the 
training  time  of  AIDE  increased  by  a  factor  of  3,  6,  28,  57  and  420  followed  by 

2  The  scaleup  test  was  conducted  against  the  key  contender  AryDE.  ENNBayes  was  not 
included  in  the  scaleup  test  as  there  is  no  training  phase. 
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A2DE  (5,  11,  58,  128  and  686)  and  A3DE  (8,  18,  72  and  161).  A3DE  did  not 
complete  when  the  sample  size  was  increased  to  five  million  instances  by  a  factor 
of  500.  The  training  time  of  MassBayes  was  constant,  irrespectively  of  the  training 
data  size. 

The  memory  requirement  of  A2DE  was  increased  by  a  factor  of  1.4,  1.7,  3,  4 
and  10  followed  by  A2DE  (1.7,  2.3,  5,  8  and  29)  and  A3DE  (2.1,  3,  9  and  16).  But, 
MassBayes  had  constant  memory  requirement. 


(a)  Training  Time 


(b)  Memory  requirement 

Fig.  10:  Increase  in  training  time  and  memory  requirement  to  learn  a  classification 
model  with  the  increase  in  the  number  of  dimensions  in  a  subset  of  KDDCup99 
data  set  with  three  largest  classes  (i.e.,  c  =  3,n  =  5125369).  The  horizontal  and 
vertical  axes  in  Figures  (a)  and  (b)  are  on  a  logarithmic  scale  of  base  2  and  10, 
respectively.  A3DE  did  not  complete  when  the  number  of  dimensions  was  increased 
to  32. 
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When  the  training  size  was  varied  from  10000  instances  to  five  million  instances, 
all  the  classifiers  produced  similar  classification  accuracy.  The  classification  accu¬ 
racy  of  MassBayes  varied  from  99.96%  to  99.99%,  whereas  that  of  AIDE  and 
A2DE  varied  from  99.89%  to  99.98%  and  99.91%  to  99.98%,  respectively.  This 
indicates  that  MassBayes  scales  better  than  the  existing  Bayesian  classifiers  in 
very  large  data  sets  both  in  terms  of  training  time  and  memory  requirement,  and 
produce  better  classification  accuracy. 

In  order  to  examine  how  well  the  classifiers  scaleup  to  the  increase  in  the 
number  of  attributes,  we  increased  the  number  of  attributes  of  the  same  subset  of 
the  KDDCup99  data  set  with  the  three  largest  classes  (n  =  5125369,  c  =  3)  from 
2  to  4,  8,  16,  24  and  32.  These  attributes  were  selected  from  the  best  attributes 
identified  using  Chi-squared  attribute  evaluation  available  in  WEKA  (Hall  et  al., 
2009).  The  test  set  had  10000  instances.  The  increase  in  training  time  and  space 
required  to  store  the  classification  model  of  MassBayes  and  three  variants  of  A?yDE 
are  presented  in  Figure  10.  The  bases  for  runtime  ratio  and  memory  ratio  are  the 
training  time  and  memory  required  by  the  data  set  with  2  attributes. 

With  the  increase  in  the  number  of  attributes  by  a  factor  of  2,  4,  8,  12  and 
16,  MassBayes  increased  its  training  time  by  a  factor  of  1.3,  1.5,  3.8,  5.9  and  6.4, 
respectively.  The  closest  contender  AIDE  increased  its  training  time  by  a  factor 
of  2.5,  4.6,  8.6,  15  and  21  followed  by  A2DE  (2.7,  6.3,  21,  81  and  166)  and  A3DE 
(3.4,  9.4,  193  and  1830).  A3DE  did  not  complete  when  the  number  of  attributes 
was  increased  to  32  by  a  factor  of  16. 

The  memory  requirement  of  MassBayes  was  increased  by  a  factor  of  1.9,  2.3, 
3.4,  4.2  and  5,  respectively.  AIDE  increased  its  memory  requirement  by  a  factor 
of  4,  13,  21,  31  and  35  followed  by  A2DE  (109,  1161,  3052,  5952  and  7219)  and 
A3DE  (868,  42491,  179230  and  463939). 


4.3  Role  of  parameters  in  ENNBayes  and  MassBayes 

In  this  section,  we  present  the  results  of  a  series  of  experiments  conducted  to  inves¬ 
tigate  the  sensitivity  of  parameters  in  ENNBayes  and  MassBayes.  We  conducted 
the  following  sets  of  experiments  by  varying  one  parameter  and  fixing  the  other 
parameters  to  the  default  values  in  the  KDDCup99  data  set. 

•  ENNBayes 

1.  Varying  ensemble  size  ( t )  in  the  range  of  [10,25,50,100]  and  fixing  ip  = 
5000. 

2.  Varying  sample  size  (ip)  in  the  range  of  [500, 1000,  5000, 10000]  and  fixing 
t  =  25. 

•  MassBayes 

1.  Varying  ensemble  size  ( t )  in  the  range  of  [10,  25,  50, 100]  and  fixing  ip  =  5000 
and  h  =  10. 

2.  Varying  sample  size  (ip)  in  the  range  of  [500, 1000,  5000, 10000]  and  fixing 
t  =  100  and  h  =  10. 

3.  Varying  height  (h)  in  the  range  [1,  5, 10, 15]  and  fixing  t  =  100  and  ip  = 
5000. 

Figures  11  and  12  show  the  effect  of  parameters  t  and  ip  in  classification  accu¬ 
racy  and  runtime  of  ENNBayes  and  MassBayes  and  Figure  13  shows  the  effect  of 
parameter  h  in  classification  accuracy  and  runtime  of  MassBayes. 
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Fig.  11:  The  effect  of  ensemble  size  (t)  on  accuracy  and  runtime  of  ENNBayes  and 
MassBayes. 
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Fig.  12:  The  effect  of  sample  size  (ip)  on  accuracy  and  runtime  of  ENNBayes  and 
MassBayes. 
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Fig.  13:  The  effect  of  height  ( h )  on  accuracy  and  runtime  of  MassBayes. 


Multidimensional  likelihood  estimation 


25 


All  the  experiments  were  conducted  using  10000  instances  for  testing  and  the 
rest  for  training.  The  increase  in  runtime  was  plotted  as  a  ratio  to  show  the 
fraction  of  runtime  increased  when  the  parameters  were  increased.  The  bases  for 
the  runtime  ratio  while  varying  t,  ip  and  h  are  the  total  runtime  (including  training 
and  testing)  for  t  =  10,  ip  =  500  and  h  =  1,  respectively. 

When  t  was  increased  from  10  to  25,  50  and  100,  the  accuracies  of  both 
ENNBayes  and  MassBayes  increased  upto  a  certain  point  ( t  =  50)  and  then  re¬ 
mained  almost  constant.  Similarly,  when  the  sample  size  was  increased  from  500  to 
1000,  5000  and  10000,  the  accuracies  of  ENNBayes  and  MassBayes  were  increased 
rapidly  initially  upto  ip  =  5000  and  then  increased  gradually  when  ip  was  increased 
to  10000.  Similarly,  as  shown  in  Figure  13,  the  accuracy  of  MassBayes  increased 
upto  a  certain  point  ( h  =  10)  and  then  remained  constant  when  h  was  increased. 

The  results  show  that  the  parameters  are  not  too  sensitive  in  terms  of  predictive 
accuracy  if  they  are  set  to  a  sufficiently  large  value.  But,  setting  the  parameters  to 
very  large  values  will  increase  the  runtime.  The  runtime  varies  linearly  with  t  in 
both  ENNBayes  and  MassBayes.  With  the  increase  in  ip,  the  runtime  of  ENNBayes 
and  MassBayes  varies  linearly  and  sub-linearly,  respectively.  In  MassBayes,  if  the 
sample  size  is  enough,  the  increase  in  h  will  increase  the  runtime  almost  linearly. 
After  reaching  a  certain  height,  the  runtime  remains  constant  as  the  tree  building 
stops  early  once  every  instances  is  isolated. 


5  Discussion 


Between  the  two  proposed  classifiers,  ENNBayes  is  computationally  expensive  due 
to  the  associated  distance  calculation.  MassBayes  is  fast  because  the  tree  structure 
speed-up  the  search  for  local  neighbourhood.  In  MassBayes,  with  moderate  d  and 
h,  the  trees  are  usually  significantly  shorter  than  the  maximum  height  hd.  The 
distinguishing  characteristics  between  ENNBayes  and  MassBayes  are  summarised 
as  follows: 


1.  ENNBayes  estimates  multi-dimensional  likelihoods  using  all  the  features,  whereas 
MassBayes  estimates  using  feature  subsets  of  different  sizes. 

2.  ENNBayes  is  based  on  nearest  neighbour  and  needs  no  training;  MassBayes 
builds  trees  in  the  training  stage. 

3.  ENNBayes  estimates  likelihoods  using  density  estimation;  MassBayes  estimates 
likelihoods  using  probability  estimation. 


In  ENNBayes,  /i(x|y)  can  also  be  estimated  considering  the  volume  in  the 
feature  space  that  includes  the  k  nearest  neighbours  of  x  in  as  Silverman 

(1986): 


fi(x\y) 


\N(X,k\Vi>y)\ 

\Di,y\  x  volume(N (n,  k\Di,y)) 


(19) 


This  implementation  does  not  provide  a  good  estimate  of  /i(x|y)  because  the 
estimate  is  heavily  influenced  by  the  volume  which  becomes  very  small,  like  in 
DEMassBayes,  even  in  data  sets  with  a  moderate  number  of  dimensions.  In  order 
to  overcome  this  problem,  we  used  the  implementation  based  on  distance  as  used 
by  Breunig  et  al.  (2000).  We  have  used  the  Euclidean  distance  ( Lp-norm  with 
p  =  2)  in  the  experiment,  but  any  value  of  p  can  be  used. 
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It  is  interesting  to  note  that  we  are  not  the  first  to  employ  nearest  neighbour 
only  to  do  density  estimation.  Naive  Bayes  Nearest  Neighbour  (NBNN)  (Boiman 
et  al.,  2008)  and  local  NBNN  (McCann  and  Lowe,  2012)  employ  nearest  neigh¬ 
bour  only  in  a  kernel  estimator  to  estimate  the  one-dinrensional  likelihoods  in  the 
Naive  Bayesian  framework.  In  contrast,  ENNBayes  employs  the  nearest  neighbour 
only  in  a  fc-nearest  neighbour  density  estimator  to  estimate  the  multi-dimensional 
likelihoods  in  the  more  general  Bayesian  framework. 

One  obvious  question  with  ENNBayes  is:  what  if  an  ordinary  fcNN  density 
estimator  is  used  instead  of  an  ensemble?  In  large  data  sets,  it  is  impossible  to  run 
because  of  its  high  time  complexity.  For  example,  when  fcNN  with  fc  =  1  was  used 
to  estimate  p(x|y)  directly  from  D.  it  could  not  complete  in  the  largest  data  set 
(KDDCup99)  in  20  days  and  it  is  estimated  to  need  140  days  to  complete.  But 
ENNBayes  completed  the  task  in  less  than  seven  days.  With  ip  =  5000  and  t  =  25, 
ENNBayes  uses  only  125000  training  instances,  20  times  less  instances  than  the 
entire  training  size  of  2.5  million  instances  and  runs  20  times  faster. 

ENNBayes  is  the  first  lazy  Bayesian  classifier  that  estimates  pi(x|y)  directly 
through  the  nearest  neighbour  density  estimation.  The  existing  lazy  Bayesian  clas¬ 
sifiers,  LBR  (Zheng  and  Webb,  2000)  and  LWNB  (Frank  et  al.,  2003),  estimate 
one-dinrensional  likelihoods  and  uses  NB.  LBR  needs  to  convert  continuous- valued 
attributes  to  categorical  attributes  and  estimates  p{xj\y)  ( j  =  1,2,---,  d)  through 
probability  estimation  in  a  local  region  defined  by  a  conjunctive  rule  covering  x. 
Similarly,  LWNB  estimates  p(xj\y)  through  (i)  a  probability  estimation  in  a  local 
region  covered  by  fc  nearest  neighbours  if  Xj  is  a  discrete  attribute  or  (ii)  through 
density  estimation  assuming  Gaussian  distribution  if  Xj  is  a  continuous-valued 
attribute. 

In  ENNBayes,  the  nearest  neighbour  search  can  be  improved  by  using  indexing 
schemes  such  as  Cover  Trees  (Beygelzimer  et  al.,  2006),  M-Trees  (Ciaccia  et  al., 
1997)  and  Projection-Indexed  Nearest  Neighbours  (PINN)  (Vries  et  al.,  2012). 

Due  to  the  tree-based  implementation,  one  may  view  MassBayes  as  an  ensemble 
of  decision  trees  like  Bagging  (Breiman,  1996),  Random  Forest  (Breiman,  2001) 
and  Random  Trees  (Liu,  2005).  However,  the  trees  are  not  decision  trees  because 
the  tree  building  process  uses  neither  class  information  nor  any  other  evaluation 
criteria.  The  purpose  of  feature  space  partitioning  in  MassBayes  is  different  from 
the  one  in  decision  trees.  In  MassBayes,  the  objective  is  to  define  local  regions 
around  the  observed  instances  in  order  to  estimate  the  likelihood  p(x|y).  In  decision 
trees,  the  objective  is  to  separate  the  classes  as  good  as  possible  in  order  to  estimate 
the  class  membership  probability  p(y |x). 

In  terms  of  implementation,  DEMassBayes  and  MassBayes  are  similar.  The 
only  different  is  MassBayes  constructs  t  trees  with  ip  instances  whereas  DEMass¬ 
Bayes  constructs  t  trees  per  class  with  a  fewer  number  of  instances  (in  average 

± 

C 

The  performance  of  both  ENNBayes  and  MassBayes  will  be  affected  by  the 
presence  of  irrelevant  attributes.  Irrelevant  attributes  make  the  distribution  sparse 
in  the  feature  space  and  affect  the  nearest  neighbour  search  in  ENNBayes  and 
feature  space  partitioning  in  MassBayes.  Irrelevant  attributes  affect  other  Bayesian 
classifiers  in  a  similar  way.  The  easiest  solution  is  to  conduct  feature  selection 
before  building  a  classification  model. 

Density  estimation  tree  (Ram  and  Gray,  2011),  that  estimates  /(x)  using  a 
decision  tree,  eliminates  irrelevant  attributes  implicitly  as  they  are  never  selected 
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to  split  the  data.  But,  it  requires  an  expensive  search  to  select  an  attribute  and  its 
cut-point  to  partition  the  data  space,  which  makes  it  inapplicable  to  large  data  sets 
due  to  its  high  time  and  space  complexities.  Also,  in  high  dimensional  problems, 
the  density  estimation  is  heavily  influenced  by  the  volume  of  the  leaf  node  as  it 
becomes  very  small  as  in  DEMassBayes. 

LiNearN  (Wells  et  al.,  2012)  is  an  ensemble  approach  to  estimate  /(x)  that 
overcomes  the  issues  associated  with  DEMass.  It  defines  local  hypercube  regions 
from  a  sub-sample  of  instances  based  on  nearest  neighbour  defined  by  L°°-norm 
and  estimates  mass  in  the  regions  using  another  sub-sample  of  instances.  It  is 
more  adaptive  than  DEMass  as  it  constructs  regions  with  different  volumes  based 
on  the  data  distribution.  LiNearN  produced  good  results  in  unsupervised  learning 
tasks  of  clustering  and  anomaly  detection.  But,  when  it  was  used  in  Bayesian 
classifier  to  estimate  p(x.\y),  the  classification  accuracy  was  not  as  impressive  as 
ENNBayes  and  MassBayes.  It  is  because  many  unseen  test  instances  fall  outside 
the  hypercube  regions  resulting  in  zero  mass  and  zero  density. 

The  proposed  ensemble  approach  to  estimate  the  nrulti-dimensional  likelihoods 
from  sub-samples  yields  constant  training  time  and  space  complexities.  It  is  ideal 
for  big  data  sets  and  data  streams.  It  also  allows  user  to  trade-off  between  the 
prediction  accuracy  and  the  computational  cost  as  required.  The  computational 
cost  (time  and  space)  can  be  reduced  by  setting  lower  values  for  the  parameters 
t  and  i/j  if  a  higher  misclassification  rate  can  be  tolerated.  Setting  the  parameters 
to  higher  values  improves  the  predictive  accuracy  at  the  expense  of  increasing 
computational  cost. 


6  Conclusions 

Existing  Bayesian  classifiers  have  been  designed  using  one-dimensional  likelihoods, 
based  on  the  premise  that  multi-dinrensional  likelihoods  are  difficult  to  estimate 
directly  from  data.  We  take  a  fresh  re-examination  on  this  premise  and  find  that 
there  is  a  simple  way  to  estimate  multi-dimensional  likelihoods  directly  from  data 
using  an  ensemble  approach. 

This  paper  presents  a  generic  ensemble  approach  of  estimating  nrulti-dimensional 
likelihood  in  Bayesian  classification  learning  using  random  sub-samples  of  the 
training  data.  We  show  that: 

1.  With  a  small  sub-sample  of  data,  multi-dinrensional  likelihood  can  be  estimated 
easily  using  existing  data  modelling  techniques. 

2.  Aggregating  estimations  from  multiple  models  provides  a  good  estimate  of 
p(x\y)  as  it  considers  different  local  neighbourhoods  of  x. 

3.  The  proposed  ensemble  approach  reduces  the  training  time  and  space  com¬ 
plexities  to  constant  with  respect  to  the  number  of  training  instances. 

Using  the  generic  ensemble  approach,  we  introduce  two  new  Bayesian  classifiers 
called  ENNBayes  and  MassBayes  using  a  nearest  neighbour  density  estimator  and 
a  probability  estimator,  respectively,  to  estimate  the  multi-dimensional  likelihoods. 
Our  empirical  evaluation  shows  that  both  ENNBayes  and  MassBayes  produced 
better  classification  accuracy  than  the  state-of-the-art  Bayesian  classifiers.  Both 
the  classifiers  are  efficient  in  terms  of  runtime  and  memory  requirement,  and  they 
scale  better  than  the  existing  Bayesian  classifiers  in  very  large  data  sets. 
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Between  ENNBayes  and  MassBayes,  ENNBayes  has  slightly  better  predictive 
accuracy  in  some  data  sets  than  MassBayes  but  it  is  one  to  three  orders  of  mag¬ 
nitude  slower  than  MassBayes  due  to  the  need  to  do  nearest  neighbour  search. 


Appendix 

A  Implementation  of  MassBayes 


Algorithm  1  :  BuildTrees(D,  t,  h) 

Inputs:  D  -  input  data,  t  -  number  of  trees,  ifj  -  sub-sampling  size,  h  -  level  of  divisions. 
Output:  F  -  a  set  of  t  hid-trees 
1:  H  «—  h  x  d  {Maximum  height  of  a  tree} 

2:  Initialize  F 
3:  for  i  =  1  to  t  do 

4:  T>  «—  sample(D ,i^)  {strictly  without  replacement} 

5:  (min, max)  <—  InitialiseWorkSpace(X)) 

6:  A  «—  {Randomised  list  of  d  attributes.} 

7:  F^-FU  SingleTr  ee(T>,  min,  max,  0,  A) 

8:  end  for 
9:  return  F 


Algorithm  2  :  SingleTree(£>,  min ,  max ,  £,  A) 

Inputs:  T>  -  input  data,  min  &  max  -  arrays  of  minimum  and  maximum  values  for  each 
attribute  that  define  a  work  space,  A  -  a  randomised  list  of  d  attributes,  i  -  current  height 
level. 

Output:  an  h:d-tree 
1:  Initialize  Node(-) 

2:  while  (i  <  H  and  \T>\  >  1)  do 

3:  q  <—  nextAttribute(A,l)  {Retrieve  an  attribute  from  A  based  on  height  level.} 

4:  midq  4—  (maxq  +  minq)/ 2 

5:  T>i  <—  filter(T>,q  <  midp) 

6:  T>r  <—  filter(T>,q  >  midq) 

7:  if  (\T>i\  =  0  )  or  (\T>r\  =  0)  then  {Reduce  range  for  single-branch  node.} 

8:  if  (lUd  >  0  )  then  maxq  4—  midq 

9:  else  minq  4—  midq 

10:  end  if 

11:  £<-£+1 

12:  continue  at  the  start  of  while  loop 

13:  end  if 

14:  {Build  two  nodes:  Left  and  Right  as  a  result  of  a  split  into  two  half-spaces.} 

15:  temp  «—  maxq;  maxq  <—  midq 

16:  Left  <—  SingleTr  ee(T>i,min,  max,  l  +  1,  A) 

17:  maxq  <—  temp;  minq  <—  midq 

18:  Right  4—  SingleTr ee(T>r,  min,  max,  i  +  1  ,A) 

19:  terminate  while  loop 

20:  end  while 

21:  classCount  «—  updateClassCount(T>) 

22:  return  Node(Left,  Right,  Split Att  4—  q,  SplitValue  4—  midq,  classCount) 
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C  Discretisation  Time 


Table  9:  Total  runtime  for  supervised  discretisation  (Fayyad  and  Irani,  1995). 


Data  sets 

#n 

#d 

#C 

Disc,  time  (s) 

KDDCup99 

5209460 

32 

40 

1290 

CoverType 

581012 

10 

7 

98 

YearPrediction 

515345 

90 

2 

467 

Census 

299285 

7 

2 

35 

SkinSegment 

245057 

3 

2 

7 

Localisation 

164860 

3 

11 

9 

MiniBooNE 

129596 

50 

2 

100 

OneBig 

68000 

20 

10 

15 

Shuttle 

58000 

8 

7 

4 

LetterRecognition 

20000 

16 

26 

3 

RingCurve 

20000 

2 

2 

1 

Wave 

20000 

2 

2 

1 

Magic04 

19020 

10 

2 

3 

GasSensor 

13790 

128 

6 

17 

Pendigits 

10992 

16 

10 

2 
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